Housing is an important investment tool in the toolkit of any investors. However, with a multiplicity of factors when choosing a house, it is often difficult to pinpoint the most important factors influencing sale price of houses.
To help investors and homeowners of Ames pick the best house and the most worthwhile upgrades in order to optmise sale price, we will build a predictor which predicts the sale price base on various aspects of a house, from floor area to roofing type, before identifying some of the most important factors which helps owners maximise the sale value of their house.
Role taken: Advisor to potential property investors and homeowners
Ames is a city located in in Story County, Iowa, USA. With a population of 66,427 in 2020, it is only the 6th largest city in Iowa, and considered a small size city. Ames has a booming economy, in 2015, it was ranked one of the top 15 "Cities That Have Done the Best Since the Recession" by Bloomberg Businessweek (Source), it also has one of the lowest unemployment rates in the united states.
# Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
import math
import statsmodels.api as sm
from scipy import special, stats
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
import matplotlib.pylab as pylab
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import KNeighborsRegressor
sns.set_theme(style = 'white')
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
pd.set_option("display.max_columns", None)
# Visualization
sns.set_theme(style = 'white')
params = {'legend.fontsize': 'x-large',
'axes.labelsize': 'large',
'axes.titlesize':'xx-large',
'xtick.labelsize':'medium',
'ytick.labelsize':'medium'}
pylab.rcParams.update(params)
# Some reusable functions
# Plot barplot for missing values
def plot_missing(df, plot_type = "count"):
# barplot for features with more than 0 missing values in count
if plot_type == 'count':
missing_count = (df.isnull().sum()).sort_values(ascending = False)
plt.figure(figsize = (20, 6))
missing_val_bar = plt.bar(x = missing_count[missing_count>0].index, height = missing_count[missing_count>0],
color = 'grey')
plt.title("Missing values count")
plt.xlabel("Feature")
plt.ylabel("Count of missing values")
plt.bar_label(missing_val_bar);
plt.xticks(rotation = 90);
# barplot for features with more than 0 missing values in ratio
elif plot_type == 'ratio':
missing_ratio = (df.isnull().sum()/df.shape[0]).sort_values(ascending = False).round(4)
plt.figure(figsize = (20, 6))
missing_val_bar = plt.bar(x = missing_ratio[missing_ratio>0].index, height = missing_ratio[missing_ratio>0],
color = 'darkgrey')
plt.title("Missing values ratio")
plt.xlabel("Feature")
plt.ylabel("Ratio of missing values")
plt.bar_label(missing_val_bar);
plt.xticks(rotation = 90);
# Function to plot scatter plot of some features against another feature
def plot_scatterplot(df, predictor_features, response_feature):
num_plots = len(predictor_features)
num_cols = 5
num_rows = math.ceil(num_plots/num_cols)
fig, axs = plt.subplots(num_rows, num_cols, figsize = (20, num_rows*(20/num_cols-0.5)))
# loop through rows and columns and plot scatter plot
predictor_count = 0
for row in range(num_rows):
for col in range(num_cols):
if predictor_count < len(predictor_features):
sns.scatterplot(data = df, x = predictor_features[predictor_count], y = response_feature, ax = axs[row, col])
predictor_count += 1
else:
break
plt.tight_layout()
# Function to plot scatter plot between pairs of features
def plot_scatterplot_pairs(df, feature_pairs):
num_plots = len(feature_pairs)
num_cols = 5
num_rows = math.ceil(num_plots/num_cols)
fig, axs = plt.subplots(num_rows, num_cols, figsize = (20, num_rows*(20/num_cols-0.5)))
# loop through rows and columns and plot scatter plot
predictor_count = 0
for row in range(num_rows):
for col in range(num_cols):
if predictor_count < len(feature_pairs)-1:
sns.scatterplot(data = df, x = feature_pairs[predictor_count][0], y = feature_pairs[predictor_count][1], ax = axs[row, col])
predictor_count += 1
else:
break
plt.tight_layout()
# Function to plot boxplots
def plot_boxplot(df, features, response, height_multiple):
"""A function to plot a series of boxplots.
df: dataframe to plot
features: list of feature names for X axis
response: str of name of response feature for y axis"""
num_plots = len(features)
num_cols = 2
num_rows = math.ceil(num_plots/num_cols)
fig, axs = plt.subplots(num_rows, num_cols, figsize = (20, num_rows*(20/num_cols*height_multiple)))
# loop through rows and columns and plot scatter plot
predictor_count = 0
for row in range(num_rows):
for col in range(num_cols):
if predictor_count < len(features):
sns.boxplot(data = df, x = features[predictor_count], y = response, ax = axs[row, col])
predictor_count += 1
else:
break
plt.tight_layout()
# Function which returns dataframe of percentage of the highest frequency value of each feature
def single_value_percentage(df):
percentage_dict = {}
for col in df.columns:
#print(df[col].value_counts()[0]/df[col].shape[0])
percentage_dict[col] = df[col].value_counts().iloc[0]/df[col].shape[0]
return pd.DataFrame(percentage_dict, index = ["Percentage"]).T.sort_values(by = "Percentage", ascending = False)
# Add a list to each of a list of lists
def add_list_to_lists(original_lists, list_to_add):
"""Add list of items to each of the other lists, and removes duplicated items
original_lists: list of lists to add to
list_to_add: new list of items to add to original lists
Returns: Tuple of updated lists of lists"""
return tuple([list(set(original_list + list_to_add)) for original_list in original_lists])
# function to select features which satisfy both conditions - a weak corr with "SalePrice" and a percentage of dominant features
def numeric_features_to_drop_by_threshold(numeric_df, corr_threshold, dominant_feature_threshold):
numeric_corr = numeric_df.corr()
weak_corr = numeric_train_corr[(numeric_train_corr>=-corr_threshold) & (numeric_train_corr<= corr_threshold)]["SalePrice"].dropna(how = 'all').index
high_single_value_list = single_value_percentage(numeric_df)[single_value_percentage(numeric_df)["Percentage"] > dominant_feature_threshold].index
weak_corr_single_val = list((high_single_value_list) & (weak_corr))
return weak_corr_single_val
# function to do min-max normalization within column
def normalize_values_columns(df, cols_to_normalize):
for col in cols_to_normalize:
df[f"{col} (normalized)"] = (df[col]-df[col].min()) / (df[col].max()-df[col].min())
len_cols_to_normalize = len(cols_to_normalize)
return df
# Train
df_original_train = pd.read_csv("../datasets/train.csv")
print(df_original_train.shape)
(2051, 81)
# Test
df_original_test = pd.read_csv("../datasets/test.csv")
print(df_original_test.shape)
df_original_test = df_original_test.sort_values(by = 'Id') # sort by Id so that it can be used for submission
(878, 80)
# Example submission
df_original_ex = pd.read_csv("../datasets/sample_sub_reg.csv")
df_original_ex.shape
(878, 2)
df_original_train.head()
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 109 | 533352170 | 60 | RL | NaN | 13517 | Pave | NaN | IR1 | Lvl | AllPub | CulDSac | Gtl | Sawyer | RRAe | Norm | 1Fam | 2Story | 6 | 8 | 1976 | 2005 | Gable | CompShg | HdBoard | Plywood | BrkFace | 289.0 | Gd | TA | CBlock | TA | TA | No | GLQ | 533.0 | Unf | 0.0 | 192.0 | 725.0 | GasA | Ex | Y | SBrkr | 725 | 754 | 0 | 1479 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 0 | NaN | Attchd | 1976.0 | RFn | 2.0 | 475.0 | TA | TA | Y | 0 | 44 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 3 | 2010 | WD | 130500 |
| 1 | 544 | 531379050 | 60 | RL | 43.0 | 11492 | Pave | NaN | IR1 | Lvl | AllPub | CulDSac | Gtl | SawyerW | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1996 | 1997 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 132.0 | Gd | TA | PConc | Gd | TA | No | GLQ | 637.0 | Unf | 0.0 | 276.0 | 913.0 | GasA | Ex | Y | SBrkr | 913 | 1209 | 0 | 2122 | 1.0 | 0.0 | 2 | 1 | 4 | 1 | Gd | 8 | Typ | 1 | TA | Attchd | 1997.0 | RFn | 2.0 | 559.0 | TA | TA | Y | 0 | 74 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 4 | 2009 | WD | 220000 |
| 2 | 153 | 535304180 | 20 | RL | 68.0 | 7922 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 5 | 7 | 1953 | 2007 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | TA | Gd | CBlock | TA | TA | No | GLQ | 731.0 | Unf | 0.0 | 326.0 | 1057.0 | GasA | TA | Y | SBrkr | 1057 | 0 | 0 | 1057 | 1.0 | 0.0 | 1 | 0 | 3 | 1 | Gd | 5 | Typ | 0 | NaN | Detchd | 1953.0 | Unf | 1.0 | 246.0 | TA | TA | Y | 0 | 52 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 1 | 2010 | WD | 109000 |
| 3 | 318 | 916386060 | 60 | RL | 73.0 | 9802 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | Timber | Norm | Norm | 1Fam | 2Story | 5 | 5 | 2006 | 2007 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | TA | TA | PConc | Gd | TA | No | Unf | 0.0 | Unf | 0.0 | 384.0 | 384.0 | GasA | Gd | Y | SBrkr | 744 | 700 | 0 | 1444 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | TA | 7 | Typ | 0 | NaN | BuiltIn | 2007.0 | Fin | 2.0 | 400.0 | TA | TA | Y | 100 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | 174000 |
| 4 | 255 | 906425045 | 50 | RL | 82.0 | 14235 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | SawyerW | Norm | Norm | 1Fam | 1.5Fin | 6 | 8 | 1900 | 1993 | Gable | CompShg | Wd Sdng | Plywood | None | 0.0 | TA | TA | PConc | Fa | Gd | No | Unf | 0.0 | Unf | 0.0 | 676.0 | 676.0 | GasA | TA | Y | SBrkr | 831 | 614 | 0 | 1445 | 0.0 | 0.0 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 0 | NaN | Detchd | 1957.0 | Unf | 2.0 | 484.0 | TA | TA | N | 0 | 59 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 3 | 2010 | WD | 138500 |
df_original_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2051 entries, 0 to 2050 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 2051 non-null int64 1 PID 2051 non-null int64 2 MS SubClass 2051 non-null int64 3 MS Zoning 2051 non-null object 4 Lot Frontage 1721 non-null float64 5 Lot Area 2051 non-null int64 6 Street 2051 non-null object 7 Alley 140 non-null object 8 Lot Shape 2051 non-null object 9 Land Contour 2051 non-null object 10 Utilities 2051 non-null object 11 Lot Config 2051 non-null object 12 Land Slope 2051 non-null object 13 Neighborhood 2051 non-null object 14 Condition 1 2051 non-null object 15 Condition 2 2051 non-null object 16 Bldg Type 2051 non-null object 17 House Style 2051 non-null object 18 Overall Qual 2051 non-null int64 19 Overall Cond 2051 non-null int64 20 Year Built 2051 non-null int64 21 Year Remod/Add 2051 non-null int64 22 Roof Style 2051 non-null object 23 Roof Matl 2051 non-null object 24 Exterior 1st 2051 non-null object 25 Exterior 2nd 2051 non-null object 26 Mas Vnr Type 2029 non-null object 27 Mas Vnr Area 2029 non-null float64 28 Exter Qual 2051 non-null object 29 Exter Cond 2051 non-null object 30 Foundation 2051 non-null object 31 Bsmt Qual 1996 non-null object 32 Bsmt Cond 1996 non-null object 33 Bsmt Exposure 1993 non-null object 34 BsmtFin Type 1 1996 non-null object 35 BsmtFin SF 1 2050 non-null float64 36 BsmtFin Type 2 1995 non-null object 37 BsmtFin SF 2 2050 non-null float64 38 Bsmt Unf SF 2050 non-null float64 39 Total Bsmt SF 2050 non-null float64 40 Heating 2051 non-null object 41 Heating QC 2051 non-null object 42 Central Air 2051 non-null object 43 Electrical 2051 non-null object 44 1st Flr SF 2051 non-null int64 45 2nd Flr SF 2051 non-null int64 46 Low Qual Fin SF 2051 non-null int64 47 Gr Liv Area 2051 non-null int64 48 Bsmt Full Bath 2049 non-null float64 49 Bsmt Half Bath 2049 non-null float64 50 Full Bath 2051 non-null int64 51 Half Bath 2051 non-null int64 52 Bedroom AbvGr 2051 non-null int64 53 Kitchen AbvGr 2051 non-null int64 54 Kitchen Qual 2051 non-null object 55 TotRms AbvGrd 2051 non-null int64 56 Functional 2051 non-null object 57 Fireplaces 2051 non-null int64 58 Fireplace Qu 1051 non-null object 59 Garage Type 1938 non-null object 60 Garage Yr Blt 1937 non-null float64 61 Garage Finish 1937 non-null object 62 Garage Cars 2050 non-null float64 63 Garage Area 2050 non-null float64 64 Garage Qual 1937 non-null object 65 Garage Cond 1937 non-null object 66 Paved Drive 2051 non-null object 67 Wood Deck SF 2051 non-null int64 68 Open Porch SF 2051 non-null int64 69 Enclosed Porch 2051 non-null int64 70 3Ssn Porch 2051 non-null int64 71 Screen Porch 2051 non-null int64 72 Pool Area 2051 non-null int64 73 Pool QC 9 non-null object 74 Fence 400 non-null object 75 Misc Feature 65 non-null object 76 Misc Val 2051 non-null int64 77 Mo Sold 2051 non-null int64 78 Yr Sold 2051 non-null int64 79 Sale Type 2051 non-null object 80 SalePrice 2051 non-null int64 dtypes: float64(11), int64(28), object(42) memory usage: 1.3+ MB
The data dictionary consists of Feature Label, Description and Categories.
Also, for the purpose of our analysis, we have added Aspect of house which is the particular component/aspect of the house which the feature belongs to, and the Data Type which tells us if the feature is nominal, ordinal or numeric
| Feature Label | Description | Categories | Aspect of house | Data Type |
|---|---|---|---|---|
| Id | Unique ID | None | Administrative | None |
| PID | Parcel Identification Number assigned to each property within the Ames | None | Unrelated | None |
| MS SubClass | The building class | 20: 1-STORY 1946 & NEWER ALL STYLES / 30: 1-STORY 1945 & OLDER / 40: 1-STORY W/FINISHED ATTIC ALL AGES / 45: 1-1/2 STORY - UNFINISHED ALL AGES / 50: 1-1/2 STORY FINISHED ALL AGES / 60: 2-STORY 1946 & NEWER / 70: 2-STORY 1945 & OLDER / 75: 2-1/2 STORY ALL AGES / 80: SPLIT OR MULTI-LEVEL / 85: SPLIT FOYER / 90: DUPLEX - ALL STYLES AND AGES / 120: 1-STORY PUD (Planned Unit Development) - 1946 & NEWER / 150: 1-1/2 STORY PUD - ALL AGES / 160: 2-STORY PUD - 1946 & NEWER / 180: PUD - MULTILEVEL - INCL SPLIT LEV/FOYER / 190: 2 FAMILY CONVERSION - ALL STYLES AND AGES | Type | Nominal |
| MS Zoning | Identifies the general zoning classification of the sale | A: Agriculture / C: Commercial / FV: Floating Village Residential / I: Industrial / RH: Residential High Density / RL: Residential Low Density / RP: Residential Low Density Park / RM: Residential Medium Density | Administrative | Nominal |
| Lot Frontage | Linear feet of street connected to property | None | Environment | Numeric |
| Lot Area | Lot size in square feet | None | Land | Numeric |
| Street | Type of road access to property | Grvl: Gravel / Pave: Paved | Environment | Nominal |
| Alley | Type of alley access to property | Grvl: Gravel / Pave: Paved / NA: No alley access | Environment | Nominal |
| Lot Shape | General shape of property | Reg: Regular / IR1: Slightly irregular / IR2: Moderately Irregular / IR3: Irregular | Land | Nominal |
| Land Contour | Flatness of the property | Lvl: Near Flat/Level / Bnk: Banked - Quick and significant rise from street grade to building / HLS: Hillside - Significant slope from side to side / Low Depression | Land | Nominal |
| Utilities | Type of utilities available | AllPub: All public Utilities (E,G,W,& S) / NoSewr: Electricity, Gas, and Water (Septic Tank) / NoSeWa: Electricity and Gas Only / ELO: Electricity only | Utility | Nominal |
| Lot Config | Lot configuration | Inside: Inside lot / Corner: Corner lot / CulDSac: Cul-de-sac / FR2: Frontage on 2 sides of property / FR3: Frontage on 3 sides of property | Environment | Nominal |
| Land Slope | Slope of property | Gtl: Gentle slope / Mod: Moderate Slope / Sev: Severe Slope | Land | Nominal |
| Neighborhood | Physical locations within Ames city limit | Value represents unique neighbourhoods in Ames | Environment | Nominal |
| Condition 1 | Proximity to main road or railroad | Artery: Adjacent to arterial street / Feedr: Adjacent to feeder street / Norm: Normal / RRNn: Within 200' of North-South Railroad / RRAn: Adjacent to North-South Railroad / PosN: Near positive off-site feature--park, greenbelt, etc. / PosA: Adjacent to postive off-site feature / RRNe: Within 200' of East-West Railroad / RRAe: Adjacent to East-West Railroad | Environment | Nominal |
| Condition 2 | Proximity to main road or railroad (if a second is present) | Same as Condition 1 | Environment | Nominal |
| Bldg Type | Type of dwelling | 1Fam: Single-family Detached / 2FmCon: Two-family Conversion; originally built as one-family dwelling / Duplx: Duplex / TwnhsE: Townhouse End Unit / TwnhsI: Townhouse Inside Unit | Type | Nominal |
| House Style | Style of dwelling | 1Story: One story / 1.5Fin: One and one-half story: 2nd level finished / 1.5Unf: One and one-half story: 2nd level unfinished / 2Story: Two story / 2.5Fin: Two and one-half story: 2nd level finished / 2.5Unf: Two and one-half story: 2nd level unfinished / SFoyer: Split Foyer / SLvl: Split Level | Overall | Nominal |
| Overall Qual | Overall material and finish quality | 10 Very Excellent / 9 Excellent / 8 Very Good / 7 Good / 6 Above Average / 5 Average / 4 Below Average / 3 Fair / 2 Poor / 1 Very Poor | Overall | Ordinal |
| Overall Cond | Overall condition rating | Same as Overall Qual | Overall | Ordinal |
| Year Built | Original construction date | None | Age | Numeric |
| Year Remod/Add | Remodel date (same as construction date if no remodeling or additions) | None | Age | Numeric |
| Roof Style | Type of roof | Flat: Flat / Gable: Gable / Gambrel: Gabrel (Barn) / Hip: Hip / Mansard: Mansard / Shed: Shed | Roof | Nominal |
| Roof Matl | Roof material | ClyTile: Clay or Tile / CompShg: Standard (Composite) Shingle / Membran: Membrane / Metal: Metal / Roll: Roll / Tar&Grv: Gravel & Tar / WdShake: Wood Shakes / WdShngl: Wood Shingles | Roof | Nominal |
| Exterior 1st | Exterior covering on house | AsbShng: Asbestos Shingles / AsphShn: Asphalt Shingles / BrkComm: Brick Common / BrkFace: Brick Face / CBlock: Cinder Block / CemntBd: Cement Board / HdBoard: Hard Board / ImStucc: Imitation Stucco / MetalSd: Metal Siding / Other: Other / Plywood:Plywood / PreCast: PreCast / Stone: Stone / Stucco: Stucco / VinylSd: Vinyl Siding / Wd Sdng: Wood Siding / WdShing: Wood Shingles | Facade | Nominal |
| Exterior 2nd | Exterior covering on house (if more than one material) | Same as Exterior 1st | Facade | Nominal |
| Mas Vnr Type | Masonry veneer type | BrkCmn: Brick Common / BrkFace: Brick Face / Cblock: Cinder Block / None: None / Stone: Stone | Facade | Nominal |
| Mas Vnr Area | Masonry veneer area in square feet | None | Facade | Numeric |
| Exter Qual | Exterior material quality | Ex: Excellent / Gd: Good / TA: Average/Typical / Fa: Fair / Po: Poor | Facade | Ordinal |
| Exter Cond | Present condition of the material on the exterior | Same as Exter Qual | Facade | Ordinal |
| Foundation | Type of foundation | BrkTil: Brick & Tile / Cblock: Cinder Block / Pconc: Poured Contrete / Slab: Slab / Stone: Stone / Wood: Wood | Foundation | Nominal |
| Bsmt Qual | Height of the basement | Ex: Excellent (100+ inches) / Gd: Good (90-99 inches) / TA: Typical (80-89 inches) / Fa: Fair (70-79 inches) / Po: Poor (<70 inches) / NA: No Basement | Basement | Ordinal |
| Bsmt Cond | General condition of the basement | Ex: Excellent / Gd: Good / TA: Typical - slight dampness allowed / Fa: Fair - dampness or some cracking or settling / Po: Poor - Severe cracking, settling, or wetness / NA: No Basement | Basement | Ordinal |
| Bsmt Exposure | Walkout or garden level basement walls | Gd: Good Exposure / Av: Average Exposure (split levels or foyers typically score average or above) / Mn: Mimimum Exposure / No: No Exposure / NA: No Basement | Basement | Ordinal |
| BsmtFin Type 1 | Quality of basement finished area | GLQ: Good Living Quarters / ALQ: Average Living Quarters / BLQ: Below Average Living Quarters / Rec: Average Rec Room / LwQ: Low Quality / Unf: Unfinshed / NA: No Basement | Basement | Ordinal |
| BsmtFin SF 1 | Type 1 finished square feet | None | Basement | Numeric |
| BsmtFin Type 2 | Quality of second finished area (if present) | Same as BsmtFin Type 1 | Basement | Ordinal |
| BsmtFin SF 2 | Type 2 finished square feet | None | Basement | Numeric |
| Bsmt Unf SF | Unfinished square feet of basement area | None | Basement | Numeric |
| Total Bsmt SF | Unfinished square feet of basement area | None | Basement | Numeric |
| Heating | Type of heating | Floor: Floor Furnace / GasA: Gas forced warm air furnace / GasW: Gas hot water or steam heat / Grav: Gravity furnace / OthW: Hot water or steam heat other than gas / Wall: Wall furnace | Utility | Nominal |
| Heating QC | Heating quality and condition | Ex: Excellent / Gd: Good / TA: Average/Typical / Fa: Fair / Po: Poor | Utility | Ordinal |
| Central Air | Central air conditioning | N: No / Y: Yes | Utility | Ordinal |
| Electrical | Electrical system | SBrkr: Standard Circuit Breakers & Romex / FuseA: Fuse Box over 60 AMP and all Romex wiring (Average) / FuseF: 60 AMP Fuse Box and mostly Romex wiring (Fair) / FuseP: 60 AMP Fuse Box and mostly knob & tube wiring (poor) / Mix: Mixed | Utility | Ordinal |
| 1st Flr SF | First Floor square feet | None | Above Grade | Numeric |
| 2nd Flr SF | Second floor square feet | None | Second Floor | Numeric |
| Low Qual Fin SF | Low quality finished square feet (all floors) | None | Above Grade | Numeric |
| Gr Liv Area | Above grade (ground) living area square feet | None | Overall | Numeric |
| Bsmt Full Bath | Basement full bathrooms | None | Bathroom | Numeric |
| Bsmt Half Bath | Basement half bathrooms | None | Bathroom | Numeric |
| Full Bath | Full bathrooms above grade | None | Bathroom | Numeric |
| Half Bath | Half baths above grade | None | Bathroom | Numeric |
| Bedroom AbvGr | Number of bedrooms above basement level | None | Bedroom | Numeric |
| Kitchen AbvGr | Number of kitchens | None | Kitchen | Numeric |
| Kitchen Qual | Kitchen quality | Ex:Excellent / Gd: Good / TA: Typical/Average / Fa: Fair / Po: Poor | Kitchen | Ordinal |
| TotRms AbvGrd | Total rooms above grade (does not include bathrooms) | None | Above Grade | Numeric |
| Functional | Home functionality rating | Typ: Typical Functionality / Min1: Minor Deductions 1 / Min2: Minor Deductions 2 / Mod: Moderate Deductions / Maj1: Major Deductions 1 / Maj2: Major Deductions 2 / Sev: Severely Damaged / Sal: Salvage only | Overall | Ordinal |
| Fireplaces | Number of fireplaces | None | Utility | Numeric |
| Fireplace Qu | Fireplace quality | Ex: Excellent - Exceptional Masonry Fireplace / Gd: Good - Masonry Fireplace in main level / TA: Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement / Fa: Fair - Prefabricated Fireplace in basement / Po: Poor - Ben Franklin Stove / NA: No Fireplace | Utility | Ordinal |
| Garage Type | Garage location | 2Types: More than one type of garage / Attchd: Attached to home / Basment: Basement Garage / BuiltIn: Built-In (Garage part of house - typically has room above garage) / CarPort: Car Port / Detchd: Detached from home / NA: No Garage | Garage | Nominal |
| Garage Yr Blt | Year garage was built | None | Age | Numeric |
| Garage Finish | Interior finish of the garage | Fin: Finished / RFn: Rough Finished / Unf: Unfinished / NA: No Garage | Garage | Ordinal |
| Garage Cars | Size of garage in car capacity | None | Garage | Numeric |
| Garage Area | Size of garage in square feet | None | Garage | Numeric |
| Garage Qual | Garage quality | Ex: Excellent / Gd: Good / TA: Typical/Average / Fa: Fair / Po: Poor / NA: No Garage | Garage | Ordinal |
| Garage Cond | Garage condition | Same as Garage Qual | Garage | Ordinal |
| Paved Drive | Paved driveway | Y: Paved / P: Partial Pavement / N: Dirt/Gravel | Driveway | Ordinal |
| Wood Deck SF | Wood deck area in square feet | None | Porch/Garden | Numeric |
| Open Porch SF | Open porch area in square feet | None | Porch/Garden | Numeric |
| Enclosed Porch | Enclosed porch area in square feet | None | Porch/Garden | Numeric |
| 3Ssn Porch | Three season porch area in square feet | None | Porch/Garden | Numeric |
| Screen Porch | Screen porch area in square feet | None | Porch/Garden | Numeric |
| Pool Area | Pool area in square feet | None | Porch/Garden | Numeric |
| Pool QC | Pool quality | Ex: Excellent / Gd: Good / TA: Average/Typical / Fa: Fair / NA: No Pool | Porch/Garden | Numeric |
| Fence | Fence quality | GdPrv: Good Privacy / MnPrv: Minimum Privacy / GdWo: Good Wood / MnWw: Minimum Wood/Wire / NA: No Fence | Porch/Garden | Nominal |
| Misc Feature | Miscellaneous feature not covered in other categories | Elev: Elevator / Gar2: 2nd Garage (if not described in garage section) / Othr: Other / Shed: Shed (over 100 SF) / TenC :Tennis Court / NA: None | Overall | Nominal |
| Misc Val | $Value of miscellaneous feature | None | Overall | Numeric |
| Mo Sold | Month Sold | None | Age | Nominal |
| Yr Sold | Year Sold | None | Age | Numeric |
| Sale Type | Type of sale | WD: Warranty Deed - Conventional / CWD: Warranty Deed - Cash / VWD: Warranty Deed - VA Loan / New: Home just constructed and sold / COD: Court Officer Deed/Estate / Con: Contract 15% Down payment regular terms / ConLw: Contract Low Down payment and low interest / ConLI: Contract Low Interest / ConLD: Contract Low Down / Oth: Other | Administrative | Nominal |
| SalePrice | The property's sale price in dollars. The response variable. | None | Price | Numeric |
To ensure similar structure of train and test sets, we will merge them into a single dataframe so that it is easier to perform data cleaning and feature engineering steps.
# First we add a column to label if it is from train or test set, so that it is easier to identify train and test sets
df_original_train['Train/Test'] = "Train"
df_original_test['Train/Test'] = "Test"
df_original_test['SalePrice'] = np.nan
# Then we concat both into single dataframe
# Note that we will remove "SalePrice" from concatenation since it is not present in
df_original_combined = pd.concat([df_original_train, df_original_test])
print(df_original_combined.shape)
df_original_combined.tail()
(2929, 82)
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | Train/Test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 536 | 2919 | 923228210 | 160 | RM | 21.0 | 1526 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | MeadowV | Norm | Norm | Twnhs | 2Story | 4 | 5 | 1970 | 1970 | Gable | CompShg | CemntBd | CmentBd | None | 0.0 | TA | TA | CBlock | TA | TA | No | Unf | 0.0 | Unf | 0.0 | 546.0 | 546.0 | GasA | TA | Y | SBrkr | 546 | 546 | 0 | 1092 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | TA | 5 | Typ | 0 | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | Y | 0 | 34 | 0 | 0 | 0 | 0 | NaN | GdPrv | NaN | 0 | 6 | 2006 | WD | NaN | Test |
| 178 | 2921 | 923228310 | 160 | RM | 21.0 | 1894 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | MeadowV | Norm | Norm | TwnhsE | 2Story | 4 | 5 | 1970 | 1970 | Gable | CompShg | CemntBd | CmentBd | None | 0.0 | TA | TA | CBlock | TA | TA | No | Rec | 252.0 | Unf | 0.0 | 294.0 | 546.0 | GasA | TA | Y | SBrkr | 546 | 546 | 0 | 1092 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | TA | 6 | Typ | 0 | NaN | CarPort | 1970.0 | Unf | 1.0 | 286.0 | TA | TA | Y | 0 | 24 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 4 | 2006 | WD | NaN | Test |
| 828 | 2922 | 923229110 | 90 | RL | 55.0 | 12640 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | Duplex | 1Story | 6 | 5 | 1976 | 1976 | Gable | CompShg | Plywood | Plywood | None | 0.0 | TA | TA | CBlock | TA | TA | Gd | Rec | 936.0 | LwQ | 396.0 | 396.0 | 1728.0 | GasA | TA | Y | SBrkr | 1728 | 0 | 0 | 1728 | 0.0 | 0.0 | 2 | 0 | 4 | 2 | TA | 8 | Typ | 0 | NaN | Attchd | 1976.0 | Unf | 2.0 | 574.0 | TA | TA | Y | 40 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 7 | 2006 | WD | NaN | Test |
| 668 | 2925 | 923251180 | 20 | RL | 160.0 | 20000 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | 1Story | 5 | 7 | 1960 | 1996 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | TA | TA | CBlock | TA | TA | No | ALQ | 1224.0 | Unf | 0.0 | 0.0 | 1224.0 | GasA | Ex | Y | SBrkr | 1224 | 0 | 0 | 1224 | 1.0 | 0.0 | 1 | 0 | 4 | 1 | TA | 7 | Typ | 1 | TA | Detchd | 1960.0 | Unf | 2.0 | 576.0 | TA | TA | Y | 474 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 9 | 2006 | WD | NaN | Test |
| 674 | 2928 | 923400125 | 85 | RL | 62.0 | 10441 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | SFoyer | 5 | 5 | 1992 | 1992 | Gable | CompShg | HdBoard | Wd Shng | None | 0.0 | TA | TA | PConc | Gd | TA | Av | GLQ | 337.0 | Unf | 0.0 | 575.0 | 912.0 | GasA | TA | Y | SBrkr | 970 | 0 | 0 | 970 | 0.0 | 1.0 | 1 | 0 | 3 | 1 | TA | 6 | Typ | 0 | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | Y | 80 | 32 | 0 | 0 | 0 | 0 | NaN | MnPrv | Shed | 700 | 7 | 2006 | WD | NaN | Test |
Let us visualize the missing data
plot_missing(df_original_combined, plot_type = "count")
plot_missing(df_original_combined, plot_type = "ratio")
For some features, the missing value could result due to the absence of the said feature. Let us investigate if this is indeed the case.
For pool, we look to see if the Pool Area feature is 0, if it is, it indicates no pool.
(df_original_combined['Pool Area']==0).sum()
2916
There are 2042 values, same as the missing count of Pool QC. We will replace missing values with NA
df_original_combined['Pool QC'].fillna("NA", inplace = True)
Now we perform the same analysis for other missing values
For Misc Feature, the data dictionary indicates that NA refers to No misc feature, so we fill the missing values with NA
df_original_combined['Misc Feature'].value_counts()
Shed 95 Gar2 5 Othr 4 TenC 1 Elev 1 Name: Misc Feature, dtype: int64
df_original_combined['Misc Feature'].fillna("NA", inplace = True)
For Alley, the data dictionary indicates that NA refers to No Alley Access, so we fill the missing values with NA
# Note that currently NA is null value, so we replace with NA
df_original_combined['Alley'].value_counts()
Grvl 120 Pave 78 Name: Alley, dtype: int64
df_original_combined['Alley'].fillna("NA", inplace = True)
Same for Fence
df_original_combined['Fence'].value_counts()
MnPrv 330 GdPrv 118 GdWo 112 MnWw 12 Name: Fence, dtype: int64
df_original_combined['Fence'].fillna("NA", inplace = True)
(df_original_combined['Fireplaces'] == 0).sum()
1422
df_original_combined['Fireplace Qu'].fillna("NA", inplace = True)
Other than Garage Type, other garage related features seem to have same number of missing values, these are probably houses with no garage.
We will first take a look at the 2 rows with missing values for other garage features but not Garage Type
(df_original_combined[(df_original_combined['Garage Qual'].isnull()) & (~df_original_combined['Garage Type'].isnull())])
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | Train/Test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1712 | 2237 | 910201180 | 70 | RM | 50.0 | 9060 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | IDOTRR | Norm | Norm | 1Fam | 2Story | 5 | 6 | 1923 | 1999 | Gable | CompShg | Wd Sdng | Plywood | None | 0.0 | TA | TA | BrkTil | Gd | TA | No | ALQ | 548.0 | Unf | 0.0 | 311.0 | 859.0 | GasA | Ex | Y | SBrkr | 942 | 886 | 0 | 1828 | 0.0 | 0.0 | 2 | 0 | 3 | 1 | Gd | 6 | Typ | 0 | NA | Detchd | NaN | NaN | NaN | NaN | NaN | NaN | Y | 174 | 0 | 212 | 0 | 0 | 0 | NA | MnPrv | NA | 0 | 3 | 2007 | WD | 150909.0 | Train |
| 764 | 1357 | 903426160 | 60 | RM | 57.0 | 8094 | Pave | Grvl | Reg | Lvl | AllPub | Inside | Gtl | OldTown | Norm | Norm | 2fmCon | 2.5Unf | 6 | 8 | 1910 | 1983 | Gable | CompShg | Wd Sdng | Wd Sdng | None | 0.0 | TA | TA | PConc | TA | TA | Mn | Rec | 196.0 | Unf | 0.0 | 1046.0 | 1242.0 | GasA | Gd | Y | SBrkr | 1242 | 742 | 0 | 1984 | 0.0 | 0.0 | 2 | 0 | 5 | 1 | TA | 8 | Typ | 0 | NA | Detchd | NaN | NaN | 1.0 | 360.0 | NaN | NaN | Y | 64 | 0 | 180 | 0 | 0 | 0 | NA | MnPrv | Shed | 1000 | 9 | 2008 | WD | NaN | Test |
Seems like these are Garage Type with Detached Garage.
For the row from training data, we will drop it.
However, we cannot drop the row from test dataframe, so we replace the missing values with
plt.figure(figsize = (6,6))
sns.scatterplot(data = df_original_combined, x = "Garage Yr Blt", y = "Year Built")
plt.xlim(1880, 2020)
plt.ylim(1880, 2020)
df_original_combined[["Garage Yr Blt", "Year Built"]].corr()
| Garage Yr Blt | Year Built | |
|---|---|---|
| Garage Yr Blt | 1.000000 | 0.834771 |
| Year Built | 0.834771 | 1.000000 |
# Train: Filter out the row with PID
df_original_combined = df_original_combined[~(df_original_combined['PID'] == 910201180)]
# Test: fillna with either year built or mode
df_original_combined[(df_original_combined['PID'] == 903426160)] = df_original_combined[(df_original_combined['PID'] == 903426160)].fillna({
"Garage Yr Blt": df_original_combined[(df_original_combined['PID'] == 903426160)]["Year Built"],
"Garage Finish": df_original_test[df_original_test["Garage Type"] == "Detchd"]["Garage Finish"].mode()[0],
"Garage Qual": df_original_test[df_original_test["Garage Type"] == "Detchd"]["Garage Qual"].mode()[0],
"Garage Cond": df_original_test[df_original_test["Garage Type"] == "Detchd"]["Garage Cond"].mode()[0]
})
df_original_combined[(df_original_combined['PID'] == 903426160)][["Garage Yr Blt", "Garage Qual", "Garage Qual", "Garage Cond"]]
| Garage Yr Blt | Garage Qual | Garage Qual | Garage Cond | |
|---|---|---|---|---|
| 764 | 1910.0 | TA | TA | TA |
Now we tackle the missing values for garage features with no garage. Similar to previous case, we will fill missing values with relevant values while taking note to prevent leakage by restricting replacement to either statistics from train or test set
# First we get the PID for the missing garage features so its more convenient to check back
garage_missing_PID = df_original_combined[df_original_combined['Garage Finish'].isnull()]['PID']
# Fill missing for other garage features
df_original_combined = df_original_combined.fillna({
"Garage Yr Blt": df_original_combined["Year Built"],
"Garage Finish": "NA",
"Garage Qual": "NA",
"Garage Cond": "NA",
"Garage Type": "NA"
})
df_original_combined[df_original_combined['PID'].isin(garage_missing_PID)].head(5)
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | Train/Test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 28 | 2243 | 911103060 | 190 | C (all) | 75.0 | 8250 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | IDOTRR | Feedr | Norm | 2fmCon | 2Story | 5 | 6 | 1895 | 2006 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | TA | Gd | CBlock | TA | TA | No | Unf | 0.0 | Unf | 0.0 | 957.0 | 957.0 | GasA | Fa | N | SBrkr | 1034 | 957 | 0 | 1991 | 0.0 | 0.0 | 2 | 0 | 4 | 2 | TA | 9 | Typ | 0 | NA | NA | 1895.0 | NA | 0.0 | 0.0 | NA | NA | N | 0 | 0 | 133 | 0 | 0 | 0 | NA | NA | NA | 0 | 6 | 2007 | WD | 119600.0 | Train |
| 53 | 330 | 923226250 | 160 | RM | 21.0 | 1476 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | MeadowV | Norm | Norm | Twnhs | 2Story | 4 | 7 | 1970 | 1970 | Gable | CompShg | CemntBd | CmentBd | None | 0.0 | TA | TA | CBlock | TA | TA | No | GLQ | 176.0 | Unf | 0.0 | 370.0 | 546.0 | GasA | Ex | Y | SBrkr | 546 | 546 | 0 | 1092 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | TA | 5 | Typ | 0 | NA | NA | 1970.0 | NA | 0.0 | 0.0 | NA | NA | Y | 200 | 26 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 3 | 2010 | WD | 76000.0 | Train |
| 65 | 2278 | 923202134 | 20 | RL | 70.0 | 8402 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Mitchel | Feedr | Norm | 1Fam | 1Story | 5 | 5 | 2007 | 2007 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | TA | TA | PConc | Gd | TA | No | ALQ | 206.0 | Unf | 0.0 | 914.0 | 1120.0 | GasA | Ex | Y | SBrkr | 1120 | 0 | 0 | 1120 | 0.0 | 0.0 | 1 | 0 | 3 | 1 | TA | 6 | Typ | 0 | NA | NA | 2007.0 | NA | 0.0 | 0.0 | NA | NA | Y | 0 | 30 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 12 | 2007 | New | 147000.0 | Train |
| 79 | 2235 | 910201050 | 50 | RM | 50.0 | 7288 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | IDOTRR | Artery | Norm | 1Fam | 1.5Fin | 5 | 7 | 1925 | 2003 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | Gd | Gd | BrkTil | TA | Po | No | Unf | 0.0 | Unf | 0.0 | 936.0 | 936.0 | GasA | Ex | Y | SBrkr | 936 | 665 | 0 | 1601 | 0.0 | 0.0 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 0 | NA | NA | 1925.0 | NA | 0.0 | 0.0 | NA | NA | N | 0 | 0 | 0 | 176 | 0 | 0 | NA | NA | NA | 0 | 9 | 2007 | WD | 129850.0 | Train |
| 101 | 2084 | 905476170 | 30 | RL | 58.0 | 9098 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Edwards | Norm | Norm | 1Fam | 1Story | 4 | 7 | 1920 | 2002 | Gable | CompShg | Wd Sdng | Wd Sdng | None | 0.0 | TA | TA | BrkTil | TA | TA | Mn | ALQ | 348.0 | Unf | 0.0 | 180.0 | 528.0 | GasA | Ex | Y | SBrkr | 605 | 0 | 0 | 605 | 1.0 | 0.0 | 1 | 0 | 2 | 1 | TA | 5 | Typ | 0 | NA | NA | 1920.0 | NA | 0.0 | 0.0 | NA | NA | N | 0 | 0 | 144 | 0 | 0 | 0 | NA | NA | NA | 0 | 7 | 2007 | WD | 86000.0 | Train |
For basement features, there are 3 observations with missing values for Bsmt Exposure but not others, let us take a closer look.
(df_original_combined[((df_original_combined['Bsmt Exposure'].isnull()) &
(~df_original_combined['Bsmt Cond'].isnull()))])
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | Train/Test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1456 | 1797 | 528458090 | 60 | FV | 81.0 | 10411 | Pave | NA | Reg | Lvl | AllPub | Corner | Gtl | Somerst | Norm | Norm | 1Fam | 2Story | 5 | 5 | 2007 | 2007 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | Gd | TA | CBlock | Gd | TA | NaN | Unf | 0.0 | Unf | 0.0 | 725.0 | 725.0 | GasA | Ex | Y | SBrkr | 725 | 863 | 0 | 1588 | 0.0 | 0.0 | 3 | 0 | 3 | 1 | Gd | 8 | Typ | 0 | NA | Attchd | 2007.0 | Unf | 2.0 | 561.0 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 7 | 2007 | New | 212109.0 | Train |
| 1547 | 67 | 528445060 | 20 | RL | 73.0 | 8987 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Somerst | Norm | Norm | 1Fam | 1Story | 8 | 5 | 2005 | 2006 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 226.0 | Gd | TA | PConc | Gd | TA | NaN | Unf | 0.0 | Unf | 0.0 | 1595.0 | 1595.0 | GasA | Ex | Y | SBrkr | 1595 | 0 | 0 | 1595 | 0.0 | 0.0 | 2 | 0 | 2 | 1 | Gd | 6 | Typ | 1 | Gd | Attchd | 2005.0 | RFn | 3.0 | 880.0 | TA | TA | Y | 144 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 5 | 2010 | WD | 221500.0 | Train |
| 1997 | 2780 | 907194130 | 60 | RL | 65.0 | 14006 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2002 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 144.0 | Gd | TA | PConc | Gd | TA | NaN | Unf | 0.0 | Unf | 0.0 | 936.0 | 936.0 | GasA | Ex | Y | SBrkr | 936 | 840 | 0 | 1776 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | Gd | 7 | Typ | 1 | TA | Attchd | 2002.0 | RFn | 2.0 | 474.0 | TA | TA | Y | 144 | 96 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2006 | WD | 192500.0 | Train |
Since there are only 3 rows, we will filter out the rows
df_original_combined = (df_original_combined[~((df_original_combined['Bsmt Exposure'].isnull()) &
(~df_original_combined['Bsmt Cond'].isnull()))])
df_original_combined.shape
(2925, 82)
For other basement features with missing values, there are missing values, it is because there are no basements, as seen from the 0 values for Total Bsmt SF . Let us fill the missing values with NA.
# A quick look to see if the basement area is 0 or NaN, which would indicate a lack of basement
df_original_combined[((df_original_combined['Bsmt Qual'].isnull()) & (df_original_combined['Total Bsmt SF'] == 0.0)) | (df_original_combined['Total Bsmt SF'].isnull())].head()
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | Train/Test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12 | 807 | 906226060 | 70 | RL | 120.0 | 26400 | Pave | NA | Reg | Bnk | AllPub | FR2 | Gtl | SawyerW | Feedr | Norm | 1Fam | 2Story | 5 | 7 | 1880 | 2007 | Gable | CompShg | HdBoard | HdBoard | None | 0.0 | Gd | TA | PConc | NaN | NaN | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.0 | GasA | Ex | Y | SBrkr | 1288 | 728 | 0 | 2016 | 0.0 | 0.0 | 1 | 0 | 4 | 1 | TA | 7 | Mod | 1 | TA | Attchd | 1900.0 | Unf | 2.0 | 576.0 | TA | TA | P | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 6 | 2009 | WD | 131000.0 | Train |
| 93 | 811 | 906226100 | 90 | RL | 64.0 | 7018 | Pave | NA | Reg | Bnk | AllPub | Inside | Gtl | SawyerW | Norm | Norm | Duplex | 1Story | 5 | 5 | 1979 | 1979 | Gable | CompShg | HdBoard | HdBoard | None | 0.0 | TA | TA | Slab | NaN | NaN | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.0 | GasA | TA | Y | SBrkr | 1535 | 0 | 0 | 1535 | 0.0 | 0.0 | 2 | 0 | 4 | 2 | TA | 8 | Typ | 0 | NA | Attchd | 1979.0 | Unf | 2.0 | 400.0 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | NA | GdPrv | NA | 0 | 6 | 2009 | WD | 118858.0 | Train |
| 114 | 781 | 905228050 | 20 | RL | NaN | 9000 | Pave | NA | Reg | Lvl | AllPub | Corner | Gtl | Sawyer | Norm | Norm | 1Fam | 1Story | 2 | 2 | 1947 | 1950 | Gable | CompShg | Wd Sdng | Wd Sdng | None | 0.0 | TA | TA | Slab | NaN | NaN | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.0 | GasA | Fa | N | SBrkr | 660 | 0 | 0 | 660 | 0.0 | 0.0 | 1 | 0 | 2 | 1 | Fa | 5 | Min2 | 0 | NA | NA | 1947.0 | NA | 0.0 | 0.0 | NA | NA | N | 0 | 0 | 100 | 0 | 0 | 0 | NA | NA | NA | 0 | 6 | 2009 | WD | 63900.0 | Train |
| 146 | 888 | 908128060 | 85 | RL | 64.0 | 7301 | Pave | NA | Reg | Lvl | AllPub | Corner | Gtl | Edwards | Norm | Norm | 1Fam | SFoyer | 7 | 5 | 2003 | 2003 | Gable | CompShg | HdBoard | HdBoard | BrkFace | 500.0 | Gd | TA | Slab | NaN | NaN | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.0 | GasA | Ex | Y | SBrkr | 495 | 1427 | 0 | 1922 | 0.0 | 0.0 | 3 | 0 | 4 | 1 | Gd | 7 | Typ | 1 | Ex | BuiltIn | 2003.0 | RFn | 2.0 | 672.0 | TA | TA | Y | 0 | 0 | 177 | 0 | 0 | 0 | NA | NA | NA | 0 | 7 | 2009 | ConLD | 198500.0 | Train |
| 183 | 1554 | 910251050 | 20 | A (agr) | 80.0 | 14584 | Pave | NA | Reg | Low | AllPub | Inside | Mod | IDOTRR | Norm | Norm | 1Fam | 1Story | 1 | 5 | 1952 | 1952 | Gable | CompShg | AsbShng | VinylSd | None | 0.0 | Fa | Po | Slab | NaN | NaN | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.0 | Wall | Po | N | FuseA | 733 | 0 | 0 | 733 | 0.0 | 0.0 | 1 | 0 | 2 | 1 | Fa | 4 | Sal | 0 | NA | Attchd | 1952.0 | Unf | 2.0 | 487.0 | Fa | Po | N | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2008 | WD | 13100.0 | Train |
# Fill missing for other basement features
df_original_combined = df_original_combined.fillna({
"Bsmt Qual": "NA",
"Bsmt Cond": "NA",
"Bsmt Exposure": "NA",
"BsmtFin Type 1": "NA",
"BsmtFin Type 2": "NA",
"BsmtFin SF 1": 0,
"BsmtFin SF 2": 0,
"Bsmt Unf SF": 0,
"Total Bsmt SF": 0,
"Bsmt Full Bath": 0,
"Bsmt Half Bath": 0
})
For Mas Vnr Type and Area, we will impute the missing values as "None" and 0.
df_original_combined['Mas Vnr Type'].unique()
array(['BrkFace', 'None', nan, 'Stone', 'BrkCmn', 'CBlock'], dtype=object)
# Fill missing for Masonry Veneer features
df_original_combined = df_original_combined.fillna({
"Mas Vnr Type": "None",
"Mas Vnr Area": 0,
})
# View the missing row for electrical
df_original_combined[df_original_combined['Electrical'].isnull()]
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | Train/Test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 634 | 1578 | 916386080 | 80 | RL | 73.0 | 9735 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Timber | Norm | Norm | 1Fam | SLvl | 5 | 5 | 2006 | 2007 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | TA | TA | PConc | Gd | TA | No | Unf | 0.0 | Unf | 0.0 | 384.0 | 384.0 | GasA | Gd | Y | NaN | 754 | 640 | 0 | 1394 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | Gd | 7 | Typ | 0 | NA | BuiltIn | 2007.0 | Fin | 2.0 | 400.0 | TA | TA | Y | 100 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 5 | 2008 | WD | NaN | Test |
Since this is an observation from the test set, we cannot drop it. Hence, we will fill it with the most frequent occurance.
df_original_combined['Electrical'] = df_original_combined['Electrical'].fillna(df_original_combined['Electrical'].mode()[0])
Lot frontage refers to the length of the street directly bordering the property. It is often the case that properties located along the same street have similar lots and styles. Properties located close to each other will have have closer PID (Parcel ID). It is also often the case that Lot Area, lot shape and Lot Config have some influence on the frontage. So we will fit a KNN model using the above mentioned features to predict the missing values for Lot Frontage.
# See if there is some relation between the features and lot frontage
plt.figure(figsize = (20, 12))
sns.scatterplot(data = df_original_combined, x = 'PID', y = "Lot Frontage", hue = 'Neighborhood', alpha = 0.5, palette = 'colorblind', size = "Lot Config", sizes = (10, 200), size_order = ["CulDSac", "FR2", "FR3", "Inside", "Corner"])
plt.xticks(rotation = 90);
plt.legend(loc = 'best')
plt.xlim(900000000, 920000000)
plt.ylim(20, 160);
There seems to some clustering for the observations so we will procced with the predictios base on the logic.
First we pick out all PIDs of with missing Lot Frontage values so its is easier to merge later.
# Get PID for rows with missing values
PID_fontage_missing = df_original_combined[df_original_combined['Lot Frontage'].isnull()]['PID']
Then we get the features columns for the relevant features.
# Get a subset of the features we want
df_original_combined_subset = df_original_combined[["PID", "MS SubClass", "Lot Frontage", "Lot Area", "Lot Config", "Neighborhood", "Train/Test"]]
# Get dummy for categorical variables
df_original_combined_subset_dummies = pd.get_dummies(data = df_original_combined_subset,
columns = ["MS SubClass", "Lot Config", "Neighborhood"], drop_first=True)
# Divide into train and test to prevent data leakage
df_original_combined_subset_dummies_train = df_original_combined_subset_dummies[df_original_combined_subset_dummies["Train/Test"] == "Train"].drop("Train/Test", axis = 1)
df_original_combined_subset_dummies_test = df_original_combined_subset_dummies[df_original_combined_subset_dummies["Train/Test"] == "Test"].drop("Train/Test", axis = 1)
df_original_combined_subset_dummies_train.head(3)
| PID | Lot Frontage | Lot Area | MS SubClass_30 | MS SubClass_40 | MS SubClass_45 | MS SubClass_50 | MS SubClass_60 | MS SubClass_70 | MS SubClass_75 | MS SubClass_80 | MS SubClass_85 | MS SubClass_90 | MS SubClass_120 | MS SubClass_150 | MS SubClass_160 | MS SubClass_180 | MS SubClass_190 | Lot Config_CulDSac | Lot Config_FR2 | Lot Config_FR3 | Lot Config_Inside | Neighborhood_Blueste | Neighborhood_BrDale | Neighborhood_BrkSide | Neighborhood_ClearCr | Neighborhood_CollgCr | Neighborhood_Crawfor | Neighborhood_Edwards | Neighborhood_Gilbert | Neighborhood_Greens | Neighborhood_GrnHill | Neighborhood_IDOTRR | Neighborhood_Landmrk | Neighborhood_MeadowV | Neighborhood_Mitchel | Neighborhood_NAmes | Neighborhood_NPkVill | Neighborhood_NWAmes | Neighborhood_NoRidge | Neighborhood_NridgHt | Neighborhood_OldTown | Neighborhood_SWISU | Neighborhood_Sawyer | Neighborhood_SawyerW | Neighborhood_Somerst | Neighborhood_StoneBr | Neighborhood_Timber | Neighborhood_Veenker | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 533352170 | NaN | 13517 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 531379050 | 43.0 | 11492 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 535304180 | 68.0 | 7922 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# We use a KNN regressor to predict the Lot Frontage
# We will use the nearest 3 neighbors (so there is a higher chance that the properties will be located close together)
# with a weighted component so those properties nearests to the property of interest will have a higher influence on our prediction
# Train
it_imp = IterativeImputer(estimator = KNeighborsRegressor(n_neighbors=3, weights = 'distance'), random_state=0)
it_impt_train = it_imp.fit_transform(df_original_combined_subset_dummies_train)
df_lot_subset_train_predicted = pd.DataFrame(it_impt_train, columns = df_original_combined_subset_dummies_train.columns)
# Test
it_imp = IterativeImputer(estimator = KNeighborsRegressor(n_neighbors=3, weights = 'distance'), random_state=0)
it_impt_test = it_imp.fit_transform(df_original_combined_subset_dummies_test)
df_lot_subset_test_predicted = pd.DataFrame(it_impt_test, columns = df_original_combined_subset_dummies_test.columns)
# Combine train and test
df_lot_subset_predicted = pd.concat([df_lot_subset_train_predicted, df_lot_subset_test_predicted])
# Replace with predicted values
df_original_combined_no_missing = df_original_combined.copy()
df_original_combined_no_missing["Lot Frontage"] = df_lot_subset_predicted["Lot Frontage"].values
# Plot the distribution before and after using imputer
plt.figure(figsize = (20, 10))
ax1 = plt.subplot(211)
sns.histplot(df_original_combined["Lot Frontage"], stat = 'density', bins = 100)
plt.xlim(20, 150)
plt.ylim(0, 0.05)
plt.title("Original Distribution")
ax2 = plt.subplot(212)
sns.histplot(df_original_combined_no_missing["Lot Frontage"], stat = 'density', bins = 100, color = "teal")
plt.xlim(20, 150)
plt.ylim(0, 0.05)
plt.title("Distribution after KNN Imputer")
plt.tight_layout();
We see that the distibutions are almost identical before and after imputing with KNNRegressor.
We have now dealt with all missing values
Let us view the current dataframe
print(df_original_combined_no_missing.shape)
df_original_combined_no_missing
(2925, 82)
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | Train/Test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 109 | 533352170 | 60 | RL | 97.621342 | 13517 | Pave | NA | IR1 | Lvl | AllPub | CulDSac | Gtl | Sawyer | RRAe | Norm | 1Fam | 2Story | 6 | 8 | 1976 | 2005 | Gable | CompShg | HdBoard | Plywood | BrkFace | 289.0 | Gd | TA | CBlock | TA | TA | No | GLQ | 533.0 | Unf | 0.0 | 192.0 | 725.0 | GasA | Ex | Y | SBrkr | 725 | 754 | 0 | 1479 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 0 | NA | Attchd | 1976.0 | RFn | 2.0 | 475.0 | TA | TA | Y | 0 | 44 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 3 | 2010 | WD | 130500.0 | Train |
| 1 | 544 | 531379050 | 60 | RL | 43.000000 | 11492 | Pave | NA | IR1 | Lvl | AllPub | CulDSac | Gtl | SawyerW | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1996 | 1997 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 132.0 | Gd | TA | PConc | Gd | TA | No | GLQ | 637.0 | Unf | 0.0 | 276.0 | 913.0 | GasA | Ex | Y | SBrkr | 913 | 1209 | 0 | 2122 | 1.0 | 0.0 | 2 | 1 | 4 | 1 | Gd | 8 | Typ | 1 | TA | Attchd | 1997.0 | RFn | 2.0 | 559.0 | TA | TA | Y | 0 | 74 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 4 | 2009 | WD | 220000.0 | Train |
| 2 | 153 | 535304180 | 20 | RL | 68.000000 | 7922 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 5 | 7 | 1953 | 2007 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | TA | Gd | CBlock | TA | TA | No | GLQ | 731.0 | Unf | 0.0 | 326.0 | 1057.0 | GasA | TA | Y | SBrkr | 1057 | 0 | 0 | 1057 | 1.0 | 0.0 | 1 | 0 | 3 | 1 | Gd | 5 | Typ | 0 | NA | Detchd | 1953.0 | Unf | 1.0 | 246.0 | TA | TA | Y | 0 | 52 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 1 | 2010 | WD | 109000.0 | Train |
| 3 | 318 | 916386060 | 60 | RL | 73.000000 | 9802 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Timber | Norm | Norm | 1Fam | 2Story | 5 | 5 | 2006 | 2007 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | TA | TA | PConc | Gd | TA | No | Unf | 0.0 | Unf | 0.0 | 384.0 | 384.0 | GasA | Gd | Y | SBrkr | 744 | 700 | 0 | 1444 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | TA | 7 | Typ | 0 | NA | BuiltIn | 2007.0 | Fin | 2.0 | 400.0 | TA | TA | Y | 100 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 4 | 2010 | WD | 174000.0 | Train |
| 4 | 255 | 906425045 | 50 | RL | 82.000000 | 14235 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | SawyerW | Norm | Norm | 1Fam | 1.5Fin | 6 | 8 | 1900 | 1993 | Gable | CompShg | Wd Sdng | Plywood | None | 0.0 | TA | TA | PConc | Fa | Gd | No | Unf | 0.0 | Unf | 0.0 | 676.0 | 676.0 | GasA | TA | Y | SBrkr | 831 | 614 | 0 | 1445 | 0.0 | 0.0 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 0 | NA | Detchd | 1957.0 | Unf | 2.0 | 484.0 | TA | TA | N | 0 | 59 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 3 | 2010 | WD | 138500.0 | Train |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 536 | 2919 | 923228210 | 160 | RM | 21.000000 | 1526 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | MeadowV | Norm | Norm | Twnhs | 2Story | 4 | 5 | 1970 | 1970 | Gable | CompShg | CemntBd | CmentBd | None | 0.0 | TA | TA | CBlock | TA | TA | No | Unf | 0.0 | Unf | 0.0 | 546.0 | 546.0 | GasA | TA | Y | SBrkr | 546 | 546 | 0 | 1092 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | TA | 5 | Typ | 0 | NA | NA | 1970.0 | NA | 0.0 | 0.0 | NA | NA | Y | 0 | 34 | 0 | 0 | 0 | 0 | NA | GdPrv | NA | 0 | 6 | 2006 | WD | NaN | Test |
| 178 | 2921 | 923228310 | 160 | RM | 21.000000 | 1894 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | MeadowV | Norm | Norm | TwnhsE | 2Story | 4 | 5 | 1970 | 1970 | Gable | CompShg | CemntBd | CmentBd | None | 0.0 | TA | TA | CBlock | TA | TA | No | Rec | 252.0 | Unf | 0.0 | 294.0 | 546.0 | GasA | TA | Y | SBrkr | 546 | 546 | 0 | 1092 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | TA | 6 | Typ | 0 | NA | CarPort | 1970.0 | Unf | 1.0 | 286.0 | TA | TA | Y | 0 | 24 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 4 | 2006 | WD | NaN | Test |
| 828 | 2922 | 923229110 | 90 | RL | 55.000000 | 12640 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | Duplex | 1Story | 6 | 5 | 1976 | 1976 | Gable | CompShg | Plywood | Plywood | None | 0.0 | TA | TA | CBlock | TA | TA | Gd | Rec | 936.0 | LwQ | 396.0 | 396.0 | 1728.0 | GasA | TA | Y | SBrkr | 1728 | 0 | 0 | 1728 | 0.0 | 0.0 | 2 | 0 | 4 | 2 | TA | 8 | Typ | 0 | NA | Attchd | 1976.0 | Unf | 2.0 | 574.0 | TA | TA | Y | 40 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 7 | 2006 | WD | NaN | Test |
| 668 | 2925 | 923251180 | 20 | RL | 160.000000 | 20000 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | 1Story | 5 | 7 | 1960 | 1996 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | TA | TA | CBlock | TA | TA | No | ALQ | 1224.0 | Unf | 0.0 | 0.0 | 1224.0 | GasA | Ex | Y | SBrkr | 1224 | 0 | 0 | 1224 | 1.0 | 0.0 | 1 | 0 | 4 | 1 | TA | 7 | Typ | 1 | TA | Detchd | 1960.0 | Unf | 2.0 | 576.0 | TA | TA | Y | 474 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 9 | 2006 | WD | NaN | Test |
| 674 | 2928 | 923400125 | 85 | RL | 62.000000 | 10441 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | SFoyer | 5 | 5 | 1992 | 1992 | Gable | CompShg | HdBoard | Wd Shng | None | 0.0 | TA | TA | PConc | Gd | TA | Av | GLQ | 337.0 | Unf | 0.0 | 575.0 | 912.0 | GasA | TA | Y | SBrkr | 970 | 0 | 0 | 970 | 0.0 | 1.0 | 1 | 0 | 3 | 1 | TA | 6 | Typ | 0 | NA | NA | 1992.0 | NA | 0.0 | 0.0 | NA | NA | Y | 80 | 32 | 0 | 0 | 0 | 0 | NA | MnPrv | Shed | 700 | 7 | 2006 | WD | NaN | Test |
2925 rows × 82 columns
Some features consists of values which indicate some form of order, these are mainly features that place some form of judgement on the condition/quality. There is also Central Air features with 2 values indicating the presence or absence of central air.
encode_features_cols = ['Overall Qual','Overall Cond','Exter Qual','Exter Cond','Bsmt Qual','Bsmt Cond',
'Heating QC','Kitchen Qual','Garage Qual','Garage Cond','Pool QC','Fireplace Qu',
'Central Air', 'Functional']
# View unique values
for col in encode_features_cols:
print(col, ": ", df_original_combined_no_missing[col].unique())
Overall Qual : [ 6 7 5 8 10 4 9 3 2 1] Overall Cond : [8 5 7 6 3 9 2 4 1] Exter Qual : ['Gd' 'TA' 'Ex' 'Fa'] Exter Cond : ['TA' 'Gd' 'Fa' 'Ex' 'Po'] Bsmt Qual : ['TA' 'Gd' 'Fa' 'NA' 'Ex' 'Po'] Bsmt Cond : ['TA' 'Gd' 'NA' 'Fa' 'Po' 'Ex'] Heating QC : ['Ex' 'TA' 'Gd' 'Fa' 'Po'] Kitchen Qual : ['Gd' 'TA' 'Fa' 'Ex' 'Po'] Garage Qual : ['TA' 'Fa' 'NA' 'Gd' 'Ex' 'Po'] Garage Cond : ['TA' 'Fa' 'NA' 'Po' 'Gd' 'Ex'] Pool QC : ['NA' 'Fa' 'Gd' 'Ex' 'TA'] Fireplace Qu : ['NA' 'TA' 'Gd' 'Po' 'Ex' 'Fa'] Central Air : ['Y' 'N'] Functional : ['Typ' 'Mod' 'Min2' 'Maj1' 'Min1' 'Sev' 'Sal' 'Maj2']
df_original_combined_no_missing.replace({
"Exter Qual":{'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5},
"Exter Cond":{'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5},
"Bsmt Qual":{'NA':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5},
"Bsmt Cond":{'NA':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5},
"Heating QC":{'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5},
"Kitchen Qual":{'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5},
"Garage Qual":{'NA':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5},
"Garage Cond":{'NA':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5},
"Pool QC":{'NA':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5},
"Fireplace Qu":{'NA':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5},
"Central Air": {"N": 0,"Y": 1},
"Functional": {'Sal':1,'Sev':2,'Maj2':3,'Maj1':4,'Mod':5,'Min2':6,'Min1':7,'Typ':8}
},inplace=True)
Instead of using year Year Built and Year Remod/Add, we could get the Age Since Built and Age Since Remod. Let us first see if the year features has any anomalies.
# The Ames dataset was taken in 2011, so any date later than that would not make sense
# Year Sold
print("Yr Sold:", df_original_combined_no_missing[df_original_combined_no_missing['Yr Sold']>2011].shape)
print("Yr Sold - Earliest: ", df_original_combined_no_missing['Yr Sold'].min())
# Year Built
print("Year Built:", df_original_combined_no_missing[df_original_combined_no_missing['Year Built']>2011].shape)
print("Year Built - Earliest: ", df_original_combined_no_missing['Year Built'].min())
# Year Remod/Add
print("Year Remod/Add:", df_original_combined_no_missing[df_original_combined_no_missing['Year Remod/Add']>2011].shape)
print("Year Remod/Add - Earliest: ", df_original_combined_no_missing['Year Remod/Add'].min())
# Garage Yr Blt
print("Garage Yr Blt:", df_original_combined_no_missing[df_original_combined_no_missing['Garage Yr Blt']>2011].shape)
print("Garage Yr Blt - Earliest: ", df_original_combined_no_missing['Garage Yr Blt'].min())
Yr Sold: (0, 82) Yr Sold - Earliest: 2006 Year Built: (0, 82) Year Built - Earliest: 1872 Year Remod/Add: (0, 82) Year Remod/Add - Earliest: 1950 Garage Yr Blt: (1, 82) Garage Yr Blt - Earliest: 1872.0
# We have one instance of Garage Yr Blt later than 2011
df_original_combined_no_missing[df_original_combined_no_missing['Garage Yr Blt']>2011]
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | Train/Test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1699 | 2261 | 916384070 | 20 | RL | 68.0 | 8298 | Pave | NA | IR1 | HLS | AllPub | Inside | Gtl | Timber | Norm | Norm | 1Fam | 1Story | 8 | 5 | 2006 | 2007 | Hip | CompShg | VinylSd | VinylSd | None | 0.0 | 4 | 3 | PConc | 4 | 3 | Av | GLQ | 583.0 | Unf | 0.0 | 963.0 | 1546.0 | GasA | 5 | 1 | SBrkr | 1564 | 0 | 0 | 1564 | 0.0 | 0.0 | 2 | 0 | 2 | 1 | 5 | 6 | 8 | 1 | 4 | Attchd | 2207.0 | RFn | 2.0 | 502.0 | 3 | 3 | Y | 132 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 9 | 2007 | New | 267300.0 | Train |
# We will replace the Garage Yr Blt with with the Year Blt
df_original_combined_no_missing = df_original_combined_no_missing.replace({"Garage Yr Blt": {2207.0:2006}})
df_original_combined_no_missing[df_original_combined_no_missing['Garage Yr Blt']>2011]
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | Train/Test |
|---|
We have now dealt with the issue with Garage Yr Blt. Let us see if there are remod year which are earlier than built year, which would not make sense as well.
df_original_combined_no_missing[df_original_combined_no_missing["Year Remod/Add"] < df_original_combined_no_missing["Year Built"]]
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | Train/Test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1003 | 851 | 907194160 | 20 | RL | 65.0 | 10739 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 1Story | 7 | 5 | 2002 | 2001 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 68.0 | 4 | 3 | PConc | 4 | 3 | No | GLQ | 1259.0 | Unf | 0.0 | 172.0 | 1431.0 | GasA | 5 | 1 | SBrkr | 1444 | 0 | 0 | 1444 | 1.0 | 0.0 | 2 | 0 | 3 | 1 | 4 | 6 | 8 | 0 | 0 | Attchd | 2002.0 | RFn | 2.0 | 577.0 | 3 | 3 | Y | 144 | 40 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 4 | 2009 | WD | 203000.0 | Train |
There is one such case, we will change Year Built to 2001.
df_original_combined_no_missing[df_original_combined_no_missing["PID"] == 907194160] = df_original_combined_no_missing[df_original_combined_no_missing["PID"] == 907194160].replace({"Year Built":{2002:2001}})
df_original_combined_no_missing[df_original_combined_no_missing["PID"] == 907194160]
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | Train/Test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1003 | 851 | 907194160 | 20 | RL | 65.0 | 10739 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 1Story | 7 | 5 | 2001 | 2001 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 68.0 | 4 | 3 | PConc | 4 | 3 | No | GLQ | 1259.0 | Unf | 0.0 | 172.0 | 1431.0 | GasA | 5 | 1 | SBrkr | 1444 | 0 | 0 | 1444 | 1.0 | 0.0 | 2 | 0 | 3 | 1 | 4 | 6 | 8 | 0 | 0 | Attchd | 2002.0 | RFn | 2.0 | 577.0 | 3 | 3 | Y | 144 | 40 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 4 | 2009 | WD | 203000.0 | Train |
We will create the new feature columns for age now.
df_original_combined_no_missing["Age Since Built"] = df_original_combined_no_missing["Yr Sold"] - df_original_combined_no_missing["Year Built"]
df_original_combined_no_missing["Age Since Remod/Add"] = df_original_combined_no_missing["Yr Sold"] - df_original_combined_no_missing["Year Remod/Add"]
df_original_combined_no_missing.head(3)
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | Train/Test | Age Since Built | Age Since Remod/Add | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 109 | 533352170 | 60 | RL | 97.621342 | 13517 | Pave | NA | IR1 | Lvl | AllPub | CulDSac | Gtl | Sawyer | RRAe | Norm | 1Fam | 2Story | 6 | 8 | 1976 | 2005 | Gable | CompShg | HdBoard | Plywood | BrkFace | 289.0 | 4 | 3 | CBlock | 3 | 3 | No | GLQ | 533.0 | Unf | 0.0 | 192.0 | 725.0 | GasA | 5 | 1 | SBrkr | 725 | 754 | 0 | 1479 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 4 | 6 | 8 | 0 | 0 | Attchd | 1976.0 | RFn | 2.0 | 475.0 | 3 | 3 | Y | 0 | 44 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 3 | 2010 | WD | 130500.0 | Train | 34 | 5 |
| 1 | 544 | 531379050 | 60 | RL | 43.000000 | 11492 | Pave | NA | IR1 | Lvl | AllPub | CulDSac | Gtl | SawyerW | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1996 | 1997 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 132.0 | 4 | 3 | PConc | 4 | 3 | No | GLQ | 637.0 | Unf | 0.0 | 276.0 | 913.0 | GasA | 5 | 1 | SBrkr | 913 | 1209 | 0 | 2122 | 1.0 | 0.0 | 2 | 1 | 4 | 1 | 4 | 8 | 8 | 1 | 3 | Attchd | 1997.0 | RFn | 2.0 | 559.0 | 3 | 3 | Y | 0 | 74 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 4 | 2009 | WD | 220000.0 | Train | 13 | 12 |
| 2 | 153 | 535304180 | 20 | RL | 68.000000 | 7922 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 5 | 7 | 1953 | 2007 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | 3 | 4 | CBlock | 3 | 3 | No | GLQ | 731.0 | Unf | 0.0 | 326.0 | 1057.0 | GasA | 3 | 1 | SBrkr | 1057 | 0 | 0 | 1057 | 1.0 | 0.0 | 1 | 0 | 3 | 1 | 4 | 5 | 8 | 0 | 0 | Detchd | 1953.0 | Unf | 1.0 | 246.0 | 3 | 3 | Y | 0 | 52 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 1 | 2010 | WD | 109000.0 | Train | 57 | 3 |
We shall do some visualizatio for the year and age features with the saleprice.
df_original_train_no_missing = df_original_combined_no_missing[df_original_combined_no_missing["Train/Test"] == "Train"]
sns.pairplot(df_original_train_no_missing[["Year Built", "Year Remod/Add", "Yr Sold", "Age Since Built", "Age Since Remod/Add", "SalePrice"]])
<seaborn.axisgrid.PairGrid at 0x198e1b54100>
df_original_train_no_missing[["Year Built", "Year Remod/Add", "Yr Sold", "Age Since Built", "Age Since Remod/Add", "SalePrice"]].corr()
| Year Built | Year Remod/Add | Yr Sold | Age Since Built | Age Since Remod/Add | SalePrice | |
|---|---|---|---|---|---|---|
| Year Built | 1.000000 | 0.629560 | -0.003963 | -0.999056 | -0.630274 | 0.571934 |
| Year Remod/Add | 0.629560 | 1.000000 | 0.043056 | -0.626987 | -0.998056 | 0.550527 |
| Yr Sold | -0.003963 | 0.043056 | 1.000000 | 0.047395 | 0.019298 | -0.015551 |
| Age Since Built | -0.999056 | -0.626987 | 0.047395 | 1.000000 | 0.630409 | -0.571972 |
| Age Since Remod/Add | -0.630274 | -0.998056 | 0.019298 | 0.630409 | 1.000000 | -0.551906 |
| SalePrice | 0.571934 | 0.550527 | -0.015551 | -0.571972 | -0.551906 | 1.000000 |
We see that Age features has a slighly better correlation value compared to Year features. They also have the advantage of being a derivation of Sold and Built years. Therefore, we will drop the Year Built and Year Remod/Add. However, we will keep the Yr Sold feature as they may account for yearly sale price fluctuations. Even from the visualization, we see that there was a price drop in 2008 which may have been the 2008 recession.
print(df_original_combined_no_missing.shape)
#df_original_combined_no_missing.drop(["Year Built", "Year Remod/Add"], axis = 1, inplace = True)
# We will create a list which holds the features to drop
features_to_drop = ["Year Built", "Year Remod/Add"]
(2925, 84)
df_original_combined_no_missing.shape
(2925, 84)
We could combine the number of bathrooms to produce a feature call Total Bath
df_original_combined_no_missing['Total Bath'] = (df_original_combined_no_missing['Bsmt Full Bath'] +
df_original_combined_no_missing['Bsmt Half Bath'] +
df_original_combined_no_missing['Full Bath']+
df_original_combined_no_missing['Half Bath'])
We will also add a Gross Flr Area feature which is the sum of Total Bsmt SF and Gr Liv Area
df_original_combined_no_missing['Gross Flr Area'] = (df_original_combined_no_missing['Total Bsmt SF'] +
df_original_combined_no_missing['Gr Liv Area'])
We will add some additional features which just contains boolean values to indicate the presence (1) or absence (0) of certain housing components like fireplace, pool, etc.
df_original_combined_no_missing.head()
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | Train/Test | Age Since Built | Age Since Remod/Add | Total Bath | Gross Flr Area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 109 | 533352170 | 60 | RL | 97.621342 | 13517 | Pave | NA | IR1 | Lvl | AllPub | CulDSac | Gtl | Sawyer | RRAe | Norm | 1Fam | 2Story | 6 | 8 | 1976 | 2005 | Gable | CompShg | HdBoard | Plywood | BrkFace | 289.0 | 4 | 3 | CBlock | 3 | 3 | No | GLQ | 533.0 | Unf | 0.0 | 192.0 | 725.0 | GasA | 5 | 1 | SBrkr | 725 | 754 | 0 | 1479 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 4 | 6 | 8 | 0 | 0 | Attchd | 1976.0 | RFn | 2.0 | 475.0 | 3 | 3 | Y | 0 | 44 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 3 | 2010 | WD | 130500.0 | Train | 34 | 5 | 3.0 | 2204.0 |
| 1 | 544 | 531379050 | 60 | RL | 43.000000 | 11492 | Pave | NA | IR1 | Lvl | AllPub | CulDSac | Gtl | SawyerW | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1996 | 1997 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 132.0 | 4 | 3 | PConc | 4 | 3 | No | GLQ | 637.0 | Unf | 0.0 | 276.0 | 913.0 | GasA | 5 | 1 | SBrkr | 913 | 1209 | 0 | 2122 | 1.0 | 0.0 | 2 | 1 | 4 | 1 | 4 | 8 | 8 | 1 | 3 | Attchd | 1997.0 | RFn | 2.0 | 559.0 | 3 | 3 | Y | 0 | 74 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 4 | 2009 | WD | 220000.0 | Train | 13 | 12 | 4.0 | 3035.0 |
| 2 | 153 | 535304180 | 20 | RL | 68.000000 | 7922 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 5 | 7 | 1953 | 2007 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | 3 | 4 | CBlock | 3 | 3 | No | GLQ | 731.0 | Unf | 0.0 | 326.0 | 1057.0 | GasA | 3 | 1 | SBrkr | 1057 | 0 | 0 | 1057 | 1.0 | 0.0 | 1 | 0 | 3 | 1 | 4 | 5 | 8 | 0 | 0 | Detchd | 1953.0 | Unf | 1.0 | 246.0 | 3 | 3 | Y | 0 | 52 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 1 | 2010 | WD | 109000.0 | Train | 57 | 3 | 2.0 | 2114.0 |
| 3 | 318 | 916386060 | 60 | RL | 73.000000 | 9802 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Timber | Norm | Norm | 1Fam | 2Story | 5 | 5 | 2006 | 2007 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | 3 | 3 | PConc | 4 | 3 | No | Unf | 0.0 | Unf | 0.0 | 384.0 | 384.0 | GasA | 4 | 1 | SBrkr | 744 | 700 | 0 | 1444 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 3 | 7 | 8 | 0 | 0 | BuiltIn | 2007.0 | Fin | 2.0 | 400.0 | 3 | 3 | Y | 100 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 4 | 2010 | WD | 174000.0 | Train | 4 | 3 | 3.0 | 1828.0 |
| 4 | 255 | 906425045 | 50 | RL | 82.000000 | 14235 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | SawyerW | Norm | Norm | 1Fam | 1.5Fin | 6 | 8 | 1900 | 1993 | Gable | CompShg | Wd Sdng | Plywood | None | 0.0 | 3 | 3 | PConc | 2 | 4 | No | Unf | 0.0 | Unf | 0.0 | 676.0 | 676.0 | GasA | 3 | 1 | SBrkr | 831 | 614 | 0 | 1445 | 0.0 | 0.0 | 2 | 0 | 3 | 1 | 3 | 6 | 8 | 0 | 0 | Detchd | 1957.0 | Unf | 2.0 | 484.0 | 3 | 3 | N | 0 | 59 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 3 | 2010 | WD | 138500.0 | Train | 110 | 17 | 2.0 | 2121.0 |
# Function to indicate if a housing component is present
def feature_presence(row):
housing_components = {"Mas Vnr Type": "Has Mas Vnr", "Bsmt Exposure": "Has Basement",
"Garage Type": "Has Garage", "Wood Deck SF": "Has Wood Deck",
"Open Porch SF": "Has Open Porch", "3Ssn Porch": "Has 3Ssn Porch",
"Screen Porch": "Has Screen Porch", "Pool Area": "Has Pool",
"2nd Flr SF": "Has 2nd Flr"}
for key in housing_components.keys():
if row[key] == "NA" or row[key] == "None" or row[key] == 0: # if indicator column is either 'None', 'NA' or 0, then the component is not present
row[housing_components[key]] = 0
else:
row[housing_components[key]] = 1
return row
df_original_combined_no_missing = df_original_combined_no_missing.apply(feature_presence, axis = 1)
print(df_original_combined_no_missing.shape)
df_original_combined_no_missing.head()
(2925, 95)
| Id | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | SalePrice | Train/Test | Age Since Built | Age Since Remod/Add | Total Bath | Gross Flr Area | Has Mas Vnr | Has Basement | Has Garage | Has Wood Deck | Has Open Porch | Has 3Ssn Porch | Has Screen Porch | Has Pool | Has 2nd Flr | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 109 | 533352170 | 60 | RL | 97.621342 | 13517 | Pave | NA | IR1 | Lvl | AllPub | CulDSac | Gtl | Sawyer | RRAe | Norm | 1Fam | 2Story | 6 | 8 | 1976 | 2005 | Gable | CompShg | HdBoard | Plywood | BrkFace | 289.0 | 4 | 3 | CBlock | 3 | 3 | No | GLQ | 533.0 | Unf | 0.0 | 192.0 | 725.0 | GasA | 5 | 1 | SBrkr | 725 | 754 | 0 | 1479 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 4 | 6 | 8 | 0 | 0 | Attchd | 1976.0 | RFn | 2.0 | 475.0 | 3 | 3 | Y | 0 | 44 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 3 | 2010 | WD | 130500.0 | Train | 34 | 5 | 3.0 | 2204.0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
| 1 | 544 | 531379050 | 60 | RL | 43.000000 | 11492 | Pave | NA | IR1 | Lvl | AllPub | CulDSac | Gtl | SawyerW | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1996 | 1997 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 132.0 | 4 | 3 | PConc | 4 | 3 | No | GLQ | 637.0 | Unf | 0.0 | 276.0 | 913.0 | GasA | 5 | 1 | SBrkr | 913 | 1209 | 0 | 2122 | 1.0 | 0.0 | 2 | 1 | 4 | 1 | 4 | 8 | 8 | 1 | 3 | Attchd | 1997.0 | RFn | 2.0 | 559.0 | 3 | 3 | Y | 0 | 74 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 4 | 2009 | WD | 220000.0 | Train | 13 | 12 | 4.0 | 3035.0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
| 2 | 153 | 535304180 | 20 | RL | 68.000000 | 7922 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 5 | 7 | 1953 | 2007 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | 3 | 4 | CBlock | 3 | 3 | No | GLQ | 731.0 | Unf | 0.0 | 326.0 | 1057.0 | GasA | 3 | 1 | SBrkr | 1057 | 0 | 0 | 1057 | 1.0 | 0.0 | 1 | 0 | 3 | 1 | 4 | 5 | 8 | 0 | 0 | Detchd | 1953.0 | Unf | 1.0 | 246.0 | 3 | 3 | Y | 0 | 52 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 1 | 2010 | WD | 109000.0 | Train | 57 | 3 | 2.0 | 2114.0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 318 | 916386060 | 60 | RL | 73.000000 | 9802 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Timber | Norm | Norm | 1Fam | 2Story | 5 | 5 | 2006 | 2007 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | 3 | 3 | PConc | 4 | 3 | No | Unf | 0.0 | Unf | 0.0 | 384.0 | 384.0 | GasA | 4 | 1 | SBrkr | 744 | 700 | 0 | 1444 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 3 | 7 | 8 | 0 | 0 | BuiltIn | 2007.0 | Fin | 2.0 | 400.0 | 3 | 3 | Y | 100 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 4 | 2010 | WD | 174000.0 | Train | 4 | 3 | 3.0 | 1828.0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
| 4 | 255 | 906425045 | 50 | RL | 82.000000 | 14235 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | SawyerW | Norm | Norm | 1Fam | 1.5Fin | 6 | 8 | 1900 | 1993 | Gable | CompShg | Wd Sdng | Plywood | None | 0.0 | 3 | 3 | PConc | 2 | 4 | No | Unf | 0.0 | Unf | 0.0 | 676.0 | 676.0 | GasA | 3 | 1 | SBrkr | 831 | 614 | 0 | 1445 | 0.0 | 0.0 | 2 | 0 | 3 | 1 | 3 | 6 | 8 | 0 | 0 | Detchd | 1957.0 | Unf | 2.0 | 484.0 | 3 | 3 | N | 0 | 59 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 3 | 2010 | WD | 138500.0 | Train | 110 | 17 | 2.0 | 2121.0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
We will separate the features into Numeric, Ordinal, Categorical and Non-Useful (such as Id) features and use different methods for processing and selection.
# Non-useful predictor
not_useful_features = ["Id", "Train/Test"]
# Categorical predictors
categorical_features = list(df_original_combined_no_missing.select_dtypes(include = "object").columns)
categorical_features.remove('Train/Test')
categorical_features = categorical_features + ["MS SubClass", 'Yr Sold', 'Mo Sold']
# Ordinal predictors - we will place boolean predictors here too since the operations are the same
ordinal_boolean_features = df_original_combined_no_missing.select_dtypes(include = ["float64", "int64"])
ordinal_boolean_features = [col for col in ordinal_boolean_features.columns if len(df_original_combined_no_missing[col].unique()) <15]
ordinal_boolean_features = list(set(ordinal_boolean_features) - set(categorical_features))
ordinal_remove_boolean = ['Bsmt Full Bath', 'Half Bath', 'Garage Cars', 'Fireplaces', 'TotRms AbvGrd',
'Full Bath', 'Total Bath', 'Kitchen AbvGr','Bsmt Half Bath', 'Pool Area', 'Bedroom AbvGr']
ordinal_boolean_features = [feature for feature in ordinal_boolean_features if feature not in ordinal_remove_boolean]
ordinal_boolean_features
# Numeric predictors
# All features not in other lists
numeric_features = [feature for feature in df_original_combined_no_missing.columns
if feature not in not_useful_features and feature not in categorical_features and feature not in ordinal_boolean_features]
df_original_train_no_missing[["BsmtFin SF 1", "BsmtFin SF 2", "Bsmt Unf SF", "Total Bsmt SF", "SalePrice"]].corr()
| BsmtFin SF 1 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | SalePrice | |
|---|---|---|---|---|---|
| BsmtFin SF 1 | 1.000000 | -0.050516 | -0.471969 | 0.540053 | 0.424732 |
| BsmtFin SF 2 | -0.050516 | 1.000000 | -0.231023 | 0.086778 | 0.016527 |
| Bsmt Unf SF | -0.471969 | -0.231023 | 1.000000 | 0.419629 | 0.190388 |
| Total Bsmt SF | 0.540053 | 0.086778 | 0.419629 | 1.000000 | 0.629516 |
| SalePrice | 0.424732 | 0.016527 | 0.190388 | 0.629516 | 1.000000 |
We now take a look at the relationship between numeric features and look for possible multicollinearity and features with weak correlations to Sale Price.
# The threshold to qualify as low correlation
corr_response_threshold = 0.1
numeric_df_train = df_original_combined_no_missing[df_original_combined_no_missing["Train/Test"] == "Train"][numeric_features]
numeric_df_train.head()
| PID | Lot Frontage | Lot Area | Year Built | Year Remod/Add | Mas Vnr Area | BsmtFin SF 1 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | TotRms AbvGrd | Fireplaces | Garage Yr Blt | Garage Cars | Garage Area | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Misc Val | SalePrice | Age Since Built | Age Since Remod/Add | Total Bath | Gross Flr Area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 533352170 | 97.621342 | 13517 | 1976 | 2005 | 289.0 | 533.0 | 0.0 | 192.0 | 725.0 | 725 | 754 | 0 | 1479 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 6 | 0 | 1976.0 | 2.0 | 475.0 | 0 | 44 | 0 | 0 | 0 | 0 | 0 | 130500.0 | 34 | 5 | 3.0 | 2204.0 |
| 1 | 531379050 | 43.000000 | 11492 | 1996 | 1997 | 132.0 | 637.0 | 0.0 | 276.0 | 913.0 | 913 | 1209 | 0 | 2122 | 1.0 | 0.0 | 2 | 1 | 4 | 1 | 8 | 1 | 1997.0 | 2.0 | 559.0 | 0 | 74 | 0 | 0 | 0 | 0 | 0 | 220000.0 | 13 | 12 | 4.0 | 3035.0 |
| 2 | 535304180 | 68.000000 | 7922 | 1953 | 2007 | 0.0 | 731.0 | 0.0 | 326.0 | 1057.0 | 1057 | 0 | 0 | 1057 | 1.0 | 0.0 | 1 | 0 | 3 | 1 | 5 | 0 | 1953.0 | 1.0 | 246.0 | 0 | 52 | 0 | 0 | 0 | 0 | 0 | 109000.0 | 57 | 3 | 2.0 | 2114.0 |
| 3 | 916386060 | 73.000000 | 9802 | 2006 | 2007 | 0.0 | 0.0 | 0.0 | 384.0 | 384.0 | 744 | 700 | 0 | 1444 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 7 | 0 | 2007.0 | 2.0 | 400.0 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 174000.0 | 4 | 3 | 3.0 | 1828.0 |
| 4 | 906425045 | 82.000000 | 14235 | 1900 | 1993 | 0.0 | 0.0 | 0.0 | 676.0 | 676.0 | 831 | 614 | 0 | 1445 | 0.0 | 0.0 | 2 | 0 | 3 | 1 | 6 | 0 | 1957.0 | 2.0 | 484.0 | 0 | 59 | 0 | 0 | 0 | 0 | 0 | 138500.0 | 110 | 17 | 2.0 | 2121.0 |
numeric_train_corr = numeric_df_train.corr()
# See which ones are above -0.1 or below 0.1, which would indicate weak correlation
weak_corr = numeric_train_corr[(numeric_train_corr>=-corr_response_threshold) & (numeric_train_corr<= corr_response_threshold)]["SalePrice"].dropna(how = 'all').index
# Plot scatterplot for weakly correlated features
plot_scatterplot(numeric_df_train, weak_corr, "SalePrice")
Seems like some features has majority of the values of 0. We will see what percentage of each feature has 0 value.
# View training data value percentage
df_original_train_no_missing = df_original_combined_no_missing[df_original_combined_no_missing["Train/Test"] == "Train"]
plt.figure(figsize = (20,5))
percentage_bar = sns.barplot(x = single_value_percentage(df_original_train_no_missing).index,
y = single_value_percentage(df_original_train_no_missing)["Percentage"], palette="viridis")
plt.xticks(rotation = 90);
#plt.bar_label(percentage_bar);
We will pick out features with a weak correlation (-0.1 < r 0.1) and a high percentage of a single value (>90%).
dominant_feature_threhold = 0.9
# List of features with high single value items (>0.9)
high_single_value_list = single_value_percentage(df_original_train_no_missing)[single_value_percentage(df_original_train_no_missing)["Percentage"] > dominant_feature_threhold].index
# pick out features with a weak correlation (-0.1 <r 0.1) and a high percentage of a single value (>90%).
weak_corr_single_val = list((high_single_value_list) & (weak_corr))
weak_corr_single_val
['Pool Area', '3Ssn Porch', 'Low Qual Fin SF', 'Misc Val', 'Bsmt Half Bath']
# Generate different feature sets to drop depending on threhold
# Feature set 1 - corr_threshold = 0.1, dominant_feature_threshold = 0.9
features_to_drop1 = numeric_features_to_drop_by_threshold(numeric_df = numeric_df_train, corr_threshold = 0.1, dominant_feature_threshold = 0.9)
features_to_drop1 = list(set(features_to_drop + features_to_drop1))
print("Num of features: ", len(features_to_drop1))
# Feature set 2 - corr_threshold = 0.2, dominant_feature_threshold = 0.9
features_to_drop2 = numeric_features_to_drop_by_threshold(numeric_df = numeric_df_train, corr_threshold = 0.2, dominant_feature_threshold = 0.9)
features_to_drop2 = list(set(features_to_drop + features_to_drop2))
print("Num of features: ", len(features_to_drop2))
# Feature set 3 - corr_threshold = 0.2, dominant_feature_threshold = 0.8
features_to_drop3 = numeric_features_to_drop_by_threshold(numeric_df = numeric_df_train, corr_threshold = 0.2, dominant_feature_threshold = 0.8)
features_to_drop3 = list(set(features_to_drop + features_to_drop3))
print("Num of features: ", len(features_to_drop3))
Num of features: 7 Num of features: 9 Num of features: 11
We have generated 3 features sets (for features to drop) by setting different threshold values.
We will use a correlation value of >0.8 or <-0.8 as threshold for high multicollinearity.
# Change this value to set threshold
multicol_threshold = 0.8
# Get list of features with corr values which falls within our condition
df_train_corr = numeric_df_train.corr()
df_train_corr_nulls = df_train_corr[(df_train_corr > multicol_threshold) | (df_train_corr < -multicol_threshold)] # df with correlation values by condition
multicol_features = list((df_train_corr_nulls.notna().sum()[(df_train_corr_nulls.notna().sum() >1)]).index)
multicol_features.append("SalePrice")
plt.figure(figsize = (15, 15))
sns.heatmap(numeric_df_train[multicol_features].corr(), cmap="RdBu_r", annot = True, fmt = ".2f",
vmin = -multicol_threshold, vmax = multicol_threshold)
<AxesSubplot:>
# Get a dictionary of features with their respective correlated features
multicollinear_dict = {}
df_train_corr_multicol = numeric_df_train[multicol_features].drop("SalePrice", axis = 1).corr()
# drop the column itself
for col in df_train_corr_multicol.columns:
correlated_features = list(df_train_corr_multicol[col][((df_train_corr_multicol[col]>multicol_threshold) | (df_train_corr_multicol[col]<-multicol_threshold))].index)
correlated_features.remove(col)
multicollinear_dict[col] = correlated_features
# To get a set of unique feature pairs in order to examine further
feature_pair_remove_dup = []
for feature, corr_feature_list in multicollinear_dict.items():
for corr_feature in corr_feature_list:
feature_pair = frozenset([feature, corr_feature])
feature_pair_remove_dup.append(feature_pair)
#set(feature_pair_remove_dup)
feature_pair_remove_dup = set(feature_pair_remove_dup) # to get unique pairs
feature_pair_remove_dup = [list(frozen_set) for frozen_set in feature_pair_remove_dup]
plot_scatterplot_pairs(numeric_df_train, feature_pair_remove_dup)
Looking at the scatter plots, we will add additional features to the list of features to be dropped.
features_to_drop1 = list(set(features_to_drop1 + ["Garage Cars", "TotRms AbvGrd"]))
features_to_drop2 = list(set(features_to_drop2 + ["Garage Cars", "TotRms AbvGrd"]))
features_to_drop3 = list(set(features_to_drop3 + ["Garage Cars", "TotRms AbvGrd"]))
print(len(features_to_drop1))
print(len(features_to_drop2))
print(len(features_to_drop3))
9 11 13
Some features such as basement areas, living areas and number of bathrooms, have a sum total features which are derived from other features, we fit them through a simple model and see if the features are significant.
numeric_df_train.head()
| PID | Lot Frontage | Lot Area | Year Built | Year Remod/Add | Mas Vnr Area | BsmtFin SF 1 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | TotRms AbvGrd | Fireplaces | Garage Yr Blt | Garage Cars | Garage Area | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Misc Val | SalePrice | Age Since Built | Age Since Remod/Add | Total Bath | Gross Flr Area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 533352170 | 97.621342 | 13517 | 1976 | 2005 | 289.0 | 533.0 | 0.0 | 192.0 | 725.0 | 725 | 754 | 0 | 1479 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 6 | 0 | 1976.0 | 2.0 | 475.0 | 0 | 44 | 0 | 0 | 0 | 0 | 0 | 130500.0 | 34 | 5 | 3.0 | 2204.0 |
| 1 | 531379050 | 43.000000 | 11492 | 1996 | 1997 | 132.0 | 637.0 | 0.0 | 276.0 | 913.0 | 913 | 1209 | 0 | 2122 | 1.0 | 0.0 | 2 | 1 | 4 | 1 | 8 | 1 | 1997.0 | 2.0 | 559.0 | 0 | 74 | 0 | 0 | 0 | 0 | 0 | 220000.0 | 13 | 12 | 4.0 | 3035.0 |
| 2 | 535304180 | 68.000000 | 7922 | 1953 | 2007 | 0.0 | 731.0 | 0.0 | 326.0 | 1057.0 | 1057 | 0 | 0 | 1057 | 1.0 | 0.0 | 1 | 0 | 3 | 1 | 5 | 0 | 1953.0 | 1.0 | 246.0 | 0 | 52 | 0 | 0 | 0 | 0 | 0 | 109000.0 | 57 | 3 | 2.0 | 2114.0 |
| 3 | 916386060 | 73.000000 | 9802 | 2006 | 2007 | 0.0 | 0.0 | 0.0 | 384.0 | 384.0 | 744 | 700 | 0 | 1444 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 7 | 0 | 2007.0 | 2.0 | 400.0 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 174000.0 | 4 | 3 | 3.0 | 1828.0 |
| 4 | 906425045 | 82.000000 | 14235 | 1900 | 1993 | 0.0 | 0.0 | 0.0 | 676.0 | 676.0 | 831 | 614 | 0 | 1445 | 0.0 | 0.0 | 2 | 0 | 3 | 1 | 6 | 0 | 1957.0 | 2.0 | 484.0 | 0 | 59 | 0 | 0 | 0 | 0 | 0 | 138500.0 | 110 | 17 | 2.0 | 2121.0 |
basement_areas_features = ["BsmtFin SF 1", "BsmtFin SF 2", "Bsmt Unf SF", "Total Bsmt SF"]
gr_areas_features = ["1st Flr SF", "2nd Flr SF", "Low Qual Fin SF", "Gr Liv Area"]
num_baths_features = ["Bsmt Full Bath", "Bsmt Half Bath", "Full Bath", "Half Bath", "Total Bath"]
all_area_features = basement_areas_features + gr_areas_features + ["Gross Flr Area"]
all_area_features
['BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Gross Flr Area']
# Run the 3 sets of features through OLS to determine p value
ols_test_features = [basement_areas_features, gr_areas_features, num_baths_features, all_area_features]
for features in ols_test_features:
ss = StandardScaler()
X_scaled = ss.fit_transform(numeric_df_train[features])
X_scaled_sm = sm.add_constant(X_scaled)
y = numeric_df_train["SalePrice"]
ols = sm.OLS(y, X_scaled_sm).fit()
print(ols.summary())
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.407
Model: OLS Adj. R-squared: 0.406
Method: Least Squares F-statistic: 467.7
Date: Wed, 09 Feb 2022 Prob (F-statistic): 2.68e-231
Time: 01:00:10 Log-Likelihood: -25462.
No. Observations: 2047 AIC: 5.093e+04
Df Residuals: 2043 BIC: 5.095e+04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.814e+05 1350.963 134.308 0.000 1.79e+05 1.84e+05
x1 2.103e+04 910.412 23.100 0.000 1.92e+04 2.28e+04
x2 2093.4971 1302.134 1.608 0.108 -460.152 4647.146
x3 1.141e+04 936.059 12.189 0.000 9574.095 1.32e+04
x4 3.36e+04 929.027 36.164 0.000 3.18e+04 3.54e+04
==============================================================================
Omnibus: 404.047 Durbin-Watson: 1.986
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8072.414
Skew: 0.363 Prob(JB): 0.00
Kurtosis: 12.702 Cond. No. 4.73e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.42e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.570
Model: OLS Adj. R-squared: 0.569
Method: Least Squares F-statistic: 902.0
Date: Wed, 09 Feb 2022 Prob (F-statistic): 0.00
Time: 01:00:10 Log-Likelihood: -25134.
No. Observations: 2047 AIC: 5.028e+04
Df Residuals: 2043 BIC: 5.030e+04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.814e+05 1150.810 157.667 0.000 1.79e+05 1.84e+05
x1 3.298e+04 852.131 38.707 0.000 3.13e+04 3.47e+04
x2 8032.1223 809.316 9.925 0.000 6444.952 9619.292
x3 -6503.4129 1148.141 -5.664 0.000 -8755.061 -4251.765
x4 3.228e+04 663.477 48.646 0.000 3.1e+04 3.36e+04
==============================================================================
Omnibus: 664.574 Durbin-Watson: 2.054
Prob(Omnibus): 0.000 Jarque-Bera (JB): 23761.903
Skew: -0.848 Prob(JB): 0.00
Kurtosis: 19.605 Cond. No. 5.47e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.19e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.419
Model: OLS Adj. R-squared: 0.418
Method: Least Squares F-statistic: 368.2
Date: Wed, 09 Feb 2022 Prob (F-statistic): 6.89e-239
Time: 01:00:10 Log-Likelihood: -25441.
No. Observations: 2047 AIC: 5.089e+04
Df Residuals: 2042 BIC: 5.092e+04
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.814e+05 1337.690 135.640 0.000 1.79e+05 1.84e+05
x1 1.087e+04 1159.589 9.373 0.000 8594.635 1.31e+04
x2 -3090.9989 1293.728 -2.389 0.017 -5628.162 -553.835
x3 2.692e+04 1198.137 22.464 0.000 2.46e+04 2.93e+04
x4 4572.5091 1227.791 3.724 0.000 2164.655 6980.363
x5 2.331e+04 679.707 34.297 0.000 2.2e+04 2.46e+04
==============================================================================
Omnibus: 561.491 Durbin-Watson: 1.980
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2117.665
Skew: 1.305 Prob(JB): 0.00
Kurtosis: 7.245 Cond. No. 2.55e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 6.46e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.630
Model: OLS Adj. R-squared: 0.629
Method: Least Squares F-statistic: 578.5
Date: Wed, 09 Feb 2022 Prob (F-statistic): 0.00
Time: 01:00:10 Log-Likelihood: -24980.
No. Observations: 2047 AIC: 4.997e+04
Df Residuals: 2040 BIC: 5.001e+04
Df Model: 6
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.814e+05 1068.284 169.847 0.000 1.79e+05 1.84e+05
x1 1.17e+04 853.310 13.714 0.000 1e+04 1.34e+04
x2 -5.2695 1046.317 -0.005 0.996 -2057.230 2046.691
x3 1109.8700 823.380 1.348 0.178 -504.884 2724.624
x4 1.309e+04 1146.422 11.417 0.000 1.08e+04 1.53e+04
x5 1.114e+04 1372.665 8.115 0.000 8447.527 1.38e+04
x6 1.118e+04 811.448 13.780 0.000 9590.472 1.28e+04
x7 -5067.2519 1071.363 -4.730 0.000 -7168.332 -2966.172
x8 1.78e+04 785.633 22.661 0.000 1.63e+04 1.93e+04
x9 1.825e+04 380.545 47.954 0.000 1.75e+04 1.9e+04
==============================================================================
Omnibus: 1433.036 Durbin-Watson: 2.030
Prob(Omnibus): 0.000 Jarque-Bera (JB): 178975.528
Skew: -2.413 Prob(JB): 0.00
Kurtosis: 48.553 Cond. No. 1.46e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.43e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
From the p-values, we see that most of the features we tested are significant, with the exception of Bsmt Unf SF with a p-value of 11.7%, we will drop this feature.
features_to_drop1 = list(set(features_to_drop1 + ["Bsmt Unf SF", "BsmtFin SF 2"]))
features_to_drop2 = list(set(features_to_drop2 + ["Bsmt Unf SF", "BsmtFin SF 2"]))
features_to_drop3 = list(set(features_to_drop3 + ["Bsmt Unf SF", "BsmtFin SF 2"]))
print(len(features_to_drop1))
print(len(features_to_drop2))
print(len(features_to_drop3))
11 13 14
Next we will examine the ordinal features by looking at some boxplots of the features against SalePrice.
ordinal_df_train = df_original_combined_no_missing[df_original_combined_no_missing["Train/Test"] == "Train"][ordinal_boolean_features+["SalePrice"]]
ordinal_df_train.head()
| Overall Qual | Has Garage | Has Pool | Has Screen Porch | Garage Cond | Bsmt Cond | Has 3Ssn Porch | Heating QC | Fireplace Qu | Exter Qual | Exter Cond | Has Open Porch | Kitchen Qual | Has Basement | Garage Qual | Has Mas Vnr | Overall Cond | Has 2nd Flr | Pool QC | Central Air | Bsmt Qual | Functional | Has Wood Deck | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 1 | 0 | 0 | 3 | 3 | 0 | 5 | 0 | 4 | 3 | 1 | 4 | 1 | 3 | 1 | 8 | 1 | 0 | 1 | 3 | 8 | 0 | 130500.0 |
| 1 | 7 | 1 | 0 | 0 | 3 | 3 | 0 | 5 | 3 | 4 | 3 | 1 | 4 | 1 | 3 | 1 | 5 | 1 | 0 | 1 | 4 | 8 | 0 | 220000.0 |
| 2 | 5 | 1 | 0 | 0 | 3 | 3 | 0 | 3 | 0 | 3 | 4 | 1 | 4 | 1 | 3 | 0 | 7 | 0 | 0 | 1 | 3 | 8 | 0 | 109000.0 |
| 3 | 5 | 1 | 0 | 0 | 3 | 3 | 0 | 4 | 0 | 3 | 3 | 0 | 3 | 1 | 3 | 0 | 5 | 1 | 0 | 1 | 4 | 8 | 1 | 174000.0 |
| 4 | 6 | 1 | 0 | 0 | 3 | 4 | 0 | 3 | 0 | 3 | 3 | 1 | 3 | 1 | 3 | 0 | 8 | 1 | 0 | 1 | 2 | 8 | 0 | 138500.0 |
plot_boxplot(df = ordinal_df_train, features = ordinal_boolean_features, response = "SalePrice", height_multiple = 0.5)
We can see from the boxplots that Pool QC is unlikely to have significant impact on SalePrice.
For others, we do see some form of difference but could not judge whether they are significant, we will create additional feature sets with different features to drop.
# Add Pool QC to our original 3 sets of feature names
features_to_drop1 = list(set(features_to_drop1 + ['Pool QC']))
features_to_drop2 = list(set(features_to_drop2 + ['Pool QC']))
features_to_drop3 = list(set(features_to_drop3 + ['Pool QC']))
# Create additional feature sets by additing more columns to drop
features_to_drop4 = list(set(features_to_drop1 + ['Has 2nd Flr', 'Has Screen Porch', 'Has 3Ssn Porch', 'Has Pool']))
features_to_drop5 = list(set(features_to_drop2 + ['Has 2nd Flr', 'Has Screen Porch', 'Has 3Ssn Porch', 'Has Pool']))
features_to_drop6 = list(set(features_to_drop3 + ['Has 2nd Flr', 'Has Screen Porch', 'Has 3Ssn Porch', 'Has Pool']))
print(len(features_to_drop1))
print(len(features_to_drop2))
print(len(features_to_drop3))
print(len(features_to_drop4))
print(len(features_to_drop5))
print(len(features_to_drop6))
12 14 15 16 18 19
We will perform the same type of analysis on the categorical features.
categorical_df_train = df_original_combined_no_missing[df_original_combined_no_missing["Train/Test"] == "Train"][categorical_features+["SalePrice"]]
categorical_df_train.head()
| MS Zoning | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Foundation | Bsmt Exposure | BsmtFin Type 1 | BsmtFin Type 2 | Heating | Electrical | Garage Type | Garage Finish | Paved Drive | Fence | Misc Feature | Sale Type | MS SubClass | Yr Sold | Mo Sold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | RL | Pave | NA | IR1 | Lvl | AllPub | CulDSac | Gtl | Sawyer | RRAe | Norm | 1Fam | 2Story | Gable | CompShg | HdBoard | Plywood | BrkFace | CBlock | No | GLQ | Unf | GasA | SBrkr | Attchd | RFn | Y | NA | NA | WD | 60 | 2010 | 3 | 130500.0 |
| 1 | RL | Pave | NA | IR1 | Lvl | AllPub | CulDSac | Gtl | SawyerW | Norm | Norm | 1Fam | 2Story | Gable | CompShg | VinylSd | VinylSd | BrkFace | PConc | No | GLQ | Unf | GasA | SBrkr | Attchd | RFn | Y | NA | NA | WD | 60 | 2009 | 4 | 220000.0 |
| 2 | RL | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | Gable | CompShg | VinylSd | VinylSd | None | CBlock | No | GLQ | Unf | GasA | SBrkr | Detchd | Unf | Y | NA | NA | WD | 20 | 2010 | 1 | 109000.0 |
| 3 | RL | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Timber | Norm | Norm | 1Fam | 2Story | Gable | CompShg | VinylSd | VinylSd | None | PConc | No | Unf | Unf | GasA | SBrkr | BuiltIn | Fin | Y | NA | NA | WD | 60 | 2010 | 4 | 174000.0 |
| 4 | RL | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | SawyerW | Norm | Norm | 1Fam | 1.5Fin | Gable | CompShg | Wd Sdng | Plywood | None | PConc | No | Unf | Unf | GasA | SBrkr | Detchd | Unf | N | NA | NA | WD | 50 | 2010 | 3 | 138500.0 |
plot_boxplot(df = categorical_df_train, features = categorical_features, response = "SalePrice", height_multiple = 0.33)
Again, we will add some features to our features to drop sets and create additional feature sets. We create 3 extra groups of feature sets.
# categorical features we are quite sure has little impact - add to existing drop lists
features_to_drop1, features_to_drop2, features_to_drop3, features_to_drop4, features_to_drop5, features_to_drop6 = add_list_to_lists([features_to_drop1, features_to_drop2, features_to_drop3, features_to_drop4, features_to_drop5, features_to_drop6],
["Mo Sold", "Yr Sold", "Fence"])
# categorical features that might have little impact, but we are not sure - create new feature sets
features_to_drop7, features_to_drop8, features_to_drop9, features_to_drop10, features_to_drop11, features_to_drop12 = add_list_to_lists([features_to_drop1, features_to_drop2, features_to_drop3, features_to_drop4, features_to_drop5, features_to_drop6],
["Alley", "Lot Shape", "Fence", "Lot Config", "Bldg Type", "Roof Style"])
# categorical features that has too many different unique values, which may result in overfitting
# we create new features on top of the previous set of features to drop
features_to_drop13, features_to_drop14, features_to_drop15, features_to_drop16, features_to_drop17, features_to_drop18 = add_list_to_lists([features_to_drop7, features_to_drop8, features_to_drop9, features_to_drop10, features_to_drop11, features_to_drop12],
["MS SubClass", "Exterior 1st", "Exterior 2nd"])
# print out the feature lengths
"""print(len(features_to_drop1))
print(len(features_to_drop2))
print(len(features_to_drop3))
print(len(features_to_drop4))
print(len(features_to_drop5))
print(len(features_to_drop6))
print(len(features_to_drop7))
print(len(features_to_drop8))
print(len(features_to_drop9))
print(len(features_to_drop10))
print(len(features_to_drop11))
print(len(features_to_drop12))
print(len(features_to_drop13))
print(len(features_to_drop14))
print(len(features_to_drop15))
print(len(features_to_drop16))
print(len(features_to_drop17))
print(len(features_to_drop18))"""
print()
Just an extra look at the neighborhoods.
sns.catplot(data = categorical_df_train, x = "Neighborhood", y = "SalePrice", palette='colorblind', height = 5, aspect = 4)
plt.xticks(rotation = 90);
We have examined and identified numeric variables with a weak correlation and a high percentage of dominant values, we will now identify ordinal and categorical variables with a high percentage of a single dominant value and create 2 more feature groups. Those with more than 80% of a dominant value and those with a 90% of dominant value.
# ordinal and categorical features
categorical_ordinal_features = categorical_features + ordinal_boolean_features
categorical_ordinal_df = df_original_combined_no_missing[df_original_combined_no_missing["Train/Test"] == "Train"][categorical_ordinal_features]
# And get the feature names of those with more than 90% or 80% dominant value
single_val_cat_ord = single_value_percentage(categorical_ordinal_df)
dominant_80 = list(single_val_cat_ord[single_val_cat_ord>0.8].dropna().index)
dominant_90 = list(single_val_cat_ord[single_val_cat_ord>0.9].dropna().index)
# And create more feature sets to drop
# Because there are many more feature sets now, we will create a dictionary to hold all feature sets
# The first feature set (0) will be an empty set - complete set of features - our baseline model
feature_sets_to_drop = [[], features_to_drop1, features_to_drop2, features_to_drop3, features_to_drop4,
features_to_drop5, features_to_drop6, features_to_drop7, features_to_drop8,
features_to_drop9, features_to_drop10, features_to_drop11, features_to_drop12,
features_to_drop13, features_to_drop14, features_to_drop15, features_to_drop16,
features_to_drop17, features_to_drop18]
# feature sets dictionary of current feature sets
feature_sets_to_drop_dict = {f"feature_set_{i}": feature_set for i, feature_set in enumerate(feature_sets_to_drop)}
# add the dominant_80 and dominant_90 features to each of the existing features to drop list
# and create 38 new feature sets, making a total of 55 sets
# for dominant_80
current_dict_length = len(feature_sets_to_drop_dict) # current length of dict
current_dict_items = list(feature_sets_to_drop_dict.values())
for i, features in enumerate(current_dict_items):
if features != []:
feature_sets_to_drop_dict[f"feature_set_{i+current_dict_length-1}"] = list(set(features + dominant_80))
# for dominant_90
current_dict_length = len(feature_sets_to_drop_dict) # current length of dict
for i, features in enumerate(current_dict_items):
if features != []:
feature_sets_to_drop_dict[f"feature_set_{i+current_dict_length-1}"] = list(set(features + dominant_90))
# print out len of feature sets to drop
"""for i, feature_list in enumerate(feature_sets_to_drop_dict.values()):
print(f"Feature set {i} length: ", len(feature_list))"""
print()
First we take a look at the histogram of each feature to see if we can reduce the skewness of some features and make them more resembling of a normal distribution. Linear regression algorithms generally handle data which are more normally distributed better.
# Plot the histogram of a number of features
def plot_hist(df_numeric):
num_plots = len(df_numeric.columns)
num_cols = 5
num_rows = math.ceil(num_plots/num_cols)
fig, axs = plt.subplots(num_rows, num_cols, figsize = (20, num_rows*(20/num_cols-0.5)))
# loop through rows and columns and plot scatter plot
predictor_count = 0
for row in range(num_rows):
for col in range(num_cols):
if predictor_count < num_plots:
sns.histplot(x = df_numeric.iloc[:,predictor_count], ax = axs[row, col], kde = True)
axs[row, col].set_title(f"Skewness: {round(df_numeric.iloc[:,predictor_count].skew(), 2)}")
predictor_count += 1
else:
break
plt.tight_layout()
plot_hist(df_original_combined_no_missing[numeric_features])
We will try to reduce the skewnesss of some features which approximately follows a normal distribution. We use a box cox transformation on (x + 1) as it is able to handle zero values. We will write a function that iterates through a range of lambda values and picks the lambda which minimizes our skewness for each feature
# We will first write a function that applies a box cox transformation after selecting the best lambda value for box cox
def boxcox_transform(col):
lambdas = np.linspace(0,1,101) # 101 values from 0 to 1
current_skew = 100000000000000 # initiate skew with an arbitrarily large value
transformed_col = 0
selected_lambda = 0
# iterate through the lambdas and find the lambda that minimizes skewness
for lamb in lambdas:
transformed_col_temp = special.boxcox1p(col, lamb)
skew = transformed_col_temp.skew()
if abs(skew) < abs(current_skew): # find the lowest skew by comparing current skew with the new skew
current_skew = skew
transformed_col = transformed_col_temp
selected_lambda = lamb
if col.name == "SalePrice": # we need the lambda for SalePrice in order to reverse box cox later
sale_price_lambda.append(selected_lambda) # add best lamnda
return transformed_col
# pick out the features we want to transform
"""features_to_transform = ["Lot Frontage", "Lot Area", "BsmtFin SF 1", "Bsmt Unf SF", "Total Bsmt SF", "Garage Area",
"Wood Deck SF", "Open Porch SF", "Enclosed Porch", "3Ssn Porch", "1st Flr SF", "Gr Liv Area",
"SalePrice","Gross Flr Area"]"""
"""features_to_transform = ['PID', 'Lot Frontage', 'Lot Area', 'Year Built', 'Year Remod/Add', 'Mas Vnr Area',
'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF',
'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath',
'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces', 'Garage Yr Blt', 'Garage Cars',
'Garage Area', 'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch',
'Pool Area', 'Misc Val', 'SalePrice', 'Age Since Built', 'Total Bath',
'Gross Flr Area']"""
features_to_transform = ['Lot Frontage', 'Lot Area', 'Year Built', 'Year Remod/Add', 'Mas Vnr Area',
'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF',
'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath',
'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces', 'Garage Yr Blt', 'Garage Cars',
'Garage Area', 'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', 'Screen Porch',
'Misc Val', 'SalePrice', 'Age Since Built', 'Total Bath',
'Gross Flr Area']
sale_price_lambda = [] # the lambda value used for box cox
transformed_features = df_original_combined_no_missing[features_to_transform].apply(boxcox_transform) # apply the transformation
# drop original selected columns and concat transformed columns to df
df_original_combined_no_missing_temp = df_original_combined_no_missing.drop(features_to_transform, axis = 1)
df_original_combined_no_missing = pd.concat([df_original_combined_no_missing_temp, transformed_features], axis = 1)
print(f"SalePrice lambda: {sale_price_lambda[0]}")
# plot the transformed features
plot_hist(df_original_combined_no_missing[features_to_transform])
SalePrice lambda: 0.06
We have significantly reduced the skewness of selected features
We can now start the modelling process with the 13 different features sets (includes base set with all features). First, we obtain dataframes by dropping different numbers of features, then we create dummy variables for categorical variables and re-split our feature sets into train and test.
# create a dictionary with the the feature set name and the associated dataframe of features
feature_sets = {}
feature_sets_dummy = {}
X_trains = {}
y_trains = {}
X_tests = {}
for i, drop_features in enumerate((feature_sets_to_drop_dict.values())):
# create feature sets by dropping features to drop
feature_set_name = f"feature_set_{i}"
feature_sets[feature_set_name] = df_original_combined_no_missing.drop(drop_features, axis = 1).drop('Id', axis = 1)
# create dummy variables for categorical features
# get list of features which are categorical
categorical_var_subset = [col for col in feature_sets[feature_set_name].columns
if col in categorical_features]
feature_sets_dummy[feature_set_name] = pd.get_dummies(feature_sets[feature_set_name],
columns= categorical_var_subset,
drop_first=True)
# re-split into train and test sets
# Train set
X_trains[feature_set_name] = feature_sets_dummy[feature_set_name][feature_sets_dummy[feature_set_name]["Train/Test"] == 'Train'].drop(["Train/Test", "SalePrice"], axis = 1)
y_trains[feature_set_name] = feature_sets_dummy[feature_set_name][feature_sets_dummy[feature_set_name]["Train/Test"] == 'Train']['SalePrice']
# Test set
X_tests[feature_set_name] = feature_sets_dummy[feature_set_name][feature_sets_dummy[feature_set_name]["Train/Test"] == 'Test'].drop(["Train/Test", "SalePrice"], axis = 1)
"""print(feature_set_name, "(X_train) shape:", X_trains[feature_set_name].shape)
print(feature_set_name, "(y_test) shape:", y_trains[feature_set_name].shape)
print(feature_set_name, "(X_test) shape:", X_tests[feature_set_name].shape)"""
Now let us review the 55 feature sets.
feature_sets_summary = pd.DataFrame([feature_sets_to_drop_dict.keys(), feature_sets_to_drop_dict.values()]).T
feature_sets_summary.rename(columns = {0:"Feature Set Name", 1:"Dropped Features"}, inplace = True)
feature_nums = [X_trains[set_name].shape[1] for set_name in feature_sets_summary["Feature Set Name"]]
feature_sets_summary = pd.concat([feature_sets_summary, pd.Series(feature_nums, name = "Number of Features")], axis = 1)
feature_sets_summary['Number of Features Dropped'] = 271 - feature_sets_summary['Number of Features']
feature_sets_summary.head(17)
| Feature Set Name | Dropped Features | Number of Features | Number of Features Dropped | |
|---|---|---|---|---|
| 0 | feature_set_0 | [] | 271 | 0 |
| 1 | feature_set_1 | [TotRms AbvGrd, Low Qual Fin SF, Pool Area, Ga... | 240 | 31 |
| 2 | feature_set_2 | [TotRms AbvGrd, Low Qual Fin SF, Screen Porch,... | 238 | 33 |
| 3 | feature_set_3 | [TotRms AbvGrd, Low Qual Fin SF, Screen Porch,... | 237 | 34 |
| 4 | feature_set_4 | [Has Pool, Pool Area, Low Qual Fin SF, Garage ... | 236 | 35 |
| 5 | feature_set_5 | [Has Pool, Pool Area, Low Qual Fin SF, Garage ... | 234 | 37 |
| 6 | feature_set_6 | [Has Pool, Pool Area, Low Qual Fin SF, Bsmt Un... | 233 | 38 |
| 7 | feature_set_7 | [Pool Area, Low Qual Fin SF, Garage Cars, Bsmt... | 222 | 49 |
| 8 | feature_set_8 | [Pool Area, Low Qual Fin SF, Garage Cars, Bsmt... | 220 | 51 |
| 9 | feature_set_9 | [Pool Area, Low Qual Fin SF, Garage Cars, Bsmt... | 219 | 52 |
| 10 | feature_set_10 | [Has Pool, Pool Area, Low Qual Fin SF, Bsmt Un... | 218 | 53 |
| 11 | feature_set_11 | [Has Pool, Pool Area, Low Qual Fin SF, Bsmt Un... | 216 | 55 |
| 12 | feature_set_12 | [Has Pool, Pool Area, Low Qual Fin SF, Garage ... | 215 | 56 |
| 13 | feature_set_13 | [Pool Area, Low Qual Fin SF, Garage Cars, Bsmt... | 176 | 95 |
| 14 | feature_set_14 | [Pool Area, Low Qual Fin SF, Garage Cars, Bsmt... | 174 | 97 |
| 15 | feature_set_15 | [Pool Area, Low Qual Fin SF, Garage Cars, Bsmt... | 173 | 98 |
| 16 | feature_set_16 | [Has Pool, Pool Area, Low Qual Fin SF, Garage ... | 172 | 99 |
feature_sets_summary['Dropped Features'][16]
['Has Pool', 'Pool Area', 'Low Qual Fin SF', 'Garage Cars', 'Bsmt Unf SF', 'Has Screen Porch', 'Bsmt Half Bath', 'Yr Sold', 'Year Built', 'Has 3Ssn Porch', 'Misc Val', 'Alley', 'Exterior 2nd', 'Exterior 1st', 'BsmtFin SF 2', '3Ssn Porch', 'MS SubClass', 'Bldg Type', 'Lot Config', 'Has 2nd Flr', 'Mo Sold', 'Pool QC', 'Lot Shape', 'TotRms AbvGrd', 'Roof Style', 'Fence', 'Year Remod/Add']
#pd.Series(feature_sets_summary['Dropped Features'][16]).to_csv("Dropped Features,csv")
# Different estimators and their parameters
estimators = {
"linear": {'estimator':LinearRegression(), 'params': {}},
"ridge": {'estimator': Ridge(), 'params': {"ridge__alpha": np.logspace(-1,5,100)}},
"lasso": {'estimator': Lasso(), 'params': {"lasso__alpha": np.logspace(-4,1,10)}},
"enet": {'estimator': ElasticNet(), "params": {"enet__alpha": np.linspace(0.1, 1.0, 10), "enet__l1_ratio": np.linspace(0.1, 1, 10)}},
"knn": {"estimator": KNeighborsRegressor(), "params": {"knn__n_neighbors": [3, 5, 7, 9, 11, 21, 31, 41, 51, 101], "knn__weights": ['uniform', 'distance']}}
}
"""estimators = {
"linear": {'estimator':LinearRegression(), 'params': {}},
"ridge": {'estimator': Ridge(), 'params': {"ridge__alpha": np.logspace(-1,5,100)}},
"knn": {"estimator": KNeighborsRegressor(), "params": {"knn__n_neighbors": [3, 5, 7, 9, 11, 21, 31, 41, 51, 101], "knn__weights": ['uniform', 'distance']}}
}
"""
# conduct grid search and save results in a list
grid_search_result = []
# loop through all feature sets, then loop through selected estimators and run pipeline
for i, feature_set in enumerate(X_trains.items()):
num_features = X_trains[f"feature_set_{i}"].shape[1]
print(f"------------------------ Feature set {i} - no. of features: {num_features} ------------------------")
grid_search_result_row = {}
for estimator_name, estimator_info in estimators.items():
#print(f"{estimator_name}------------------------------------------------------------------")
#print()
# create pipeline
pipe = Pipeline([
("ss", StandardScaler()),
(estimator_name, estimator_info['estimator'])
])
# run gridsearch
pipe_params = estimator_info['params']
pipe_gridsearch = GridSearchCV(pipe,
pipe_params,
cv = 5,
verbose = 0,
scoring = "neg_root_mean_squared_error",
n_jobs = 7)
pipe_gridsearch.fit(X_trains[f"feature_set_{i}"], y_trains[f"feature_set_{i}"])
print(f"{estimator_name} score:", -pipe_gridsearch.best_score_)
# add estimator and score into a dictionary
grid_search_result_row["feature_set"] = f"feature_set_{i}"
grid_search_result_row["num of features"] = X_trains[f"feature_set_{i}"].shape[1]
grid_search_result_row[f"{estimator_name}: grid search object"] = pipe_gridsearch
grid_search_result_row[f"{estimator_name}: best estimator"] = pipe_gridsearch.best_estimator_
grid_search_result_row[f"{estimator_name}: best score"] = -(pipe_gridsearch.best_score_)
grid_search_result_row[f"{estimator_name}: best params"] = pipe_gridsearch.best_params_
grid_search_result.append(grid_search_result_row)
------------------------ Feature set 0 - no. of features: 271 ------------------------ linear score: 498488743415.54065 ridge score: 0.26725876886783595 lasso score: 0.26309780391027265 enet score: 0.26258635092382276 knn score: 0.39870842935812634 ------------------------ Feature set 1 - no. of features: 240 ------------------------ linear score: 692656179964.986 ridge score: 0.26722647152242684 lasso score: 0.26308873516842757 enet score: 0.26363932347471253 knn score: 0.3972344053772397 ------------------------ Feature set 2 - no. of features: 238 ------------------------ linear score: 2093743461355.0813 ridge score: 0.26750890568313535 lasso score: 0.26353987677394775 enet score: 0.26381520530712976 knn score: 0.39769114847281933 ------------------------ Feature set 3 - no. of features: 237 ------------------------ linear score: 2220203398457.7915 ridge score: 0.2672863634885267 lasso score: 0.263286513538074 enet score: 0.2637934005400305 knn score: 0.39803222970637087 ------------------------ Feature set 4 - no. of features: 236 ------------------------ linear score: 722625370982.4689 ridge score: 0.26544653903816434 lasso score: 0.2619569106473491 enet score: 0.26335641081694 knn score: 0.3978800333997581 ------------------------ Feature set 5 - no. of features: 234 ------------------------ linear score: 733271047814.1914 ridge score: 0.2681107225021523 lasso score: 0.26441393436781 enet score: 0.2651392955609522 knn score: 0.3960867847467481 ------------------------ Feature set 6 - no. of features: 233 ------------------------ linear score: 3600750797304.053 ridge score: 0.26788275406194045 lasso score: 0.2642929235293023 enet score: 0.26513929587963303 knn score: 0.39639499412261203 ------------------------ Feature set 7 - no. of features: 222 ------------------------ linear score: 2806436275094.521 ridge score: 0.2664063867547932 lasso score: 0.26208127650935265 enet score: 0.26317025890213647 knn score: 0.3958438037181441 ------------------------ Feature set 8 - no. of features: 220 ------------------------ linear score: 995172161833.8326 ridge score: 0.2666289748633257 lasso score: 0.2624940770642178 enet score: 0.2633799384353675 knn score: 0.3962666521528629 ------------------------ Feature set 9 - no. of features: 219 ------------------------ linear score: 223910030411.65674 ridge score: 0.2664822123028654 lasso score: 0.26224542573926557 enet score: 0.26335197928240134 knn score: 0.39637482286220405 ------------------------ Feature set 10 - no. of features: 218 ------------------------ linear score: 6014031210590.629 ridge score: 0.26445342202286215 lasso score: 0.26084502559582134 enet score: 0.2627929869810771 knn score: 0.3957768114502442 ------------------------ Feature set 11 - no. of features: 216 ------------------------ linear score: 411138844297.12604 ridge score: 0.26683472575592004 lasso score: 0.2631849737445958 enet score: 0.26450451855276536 knn score: 0.39589379952376613 ------------------------ Feature set 12 - no. of features: 215 ------------------------ linear score: 431587834963.42645 ridge score: 0.2666558147458217 lasso score: 0.2630660245263232 enet score: 0.2645045190244619 knn score: 0.3957590786676609 ------------------------ Feature set 13 - no. of features: 176 ------------------------ linear score: 1212924791685.2466 ridge score: 0.2644165392716574 lasso score: 0.26067963830798635 enet score: 0.263575923004876 knn score: 0.3914108890975717 ------------------------ Feature set 14 - no. of features: 174 ------------------------ linear score: 417705088831.77277 ridge score: 0.2653390409940501 lasso score: 0.26152274565576605 enet score: 0.26406817158198576 knn score: 0.39389966567408224 ------------------------ Feature set 15 - no. of features: 173 ------------------------ linear score: 132211993180.52454 ridge score: 0.2650709652926926 lasso score: 0.2612527073125805 enet score: 0.26404587878317975 knn score: 0.3942618944234992 ------------------------ Feature set 16 - no. of features: 172 ------------------------ linear score: 230888766398.2557 ridge score: 0.26236049385895915 lasso score: 0.25952560864712854 enet score: 0.2631557407022984 knn score: 0.39299770917467525 ------------------------ Feature set 17 - no. of features: 170 ------------------------ linear score: 140366949641.5768 ridge score: 0.26530509891998566 lasso score: 0.2621770001634072 enet score: 0.26529637826874225 knn score: 0.39475457887354903 ------------------------ Feature set 18 - no. of features: 169 ------------------------ linear score: 2165198617104.5457 ridge score: 0.2650514873440087 lasso score: 0.26203723790609146 enet score: 0.26529637829448216 knn score: 0.3951512470319168 ------------------------ Feature set 19 - no. of features: 162 ------------------------ linear score: 80422724882.66663 ridge score: 0.26809924476835645 lasso score: 0.2653067891715481 enet score: 0.2689465940524792 knn score: 0.39092768814033585 ------------------------ Feature set 20 - no. of features: 160 ------------------------ linear score: 5473694991.528475 ridge score: 0.27056742878624973 lasso score: 0.2679218168636668 enet score: 0.27075590105396696 knn score: 0.3890902136424989 ------------------------ Feature set 21 - no. of features: 159 ------------------------ linear score: 2168472804.370955 ridge score: 0.2704046371669322 lasso score: 0.2676926156299388 enet score: 0.2707559018041787 knn score: 0.3886903936741371 ------------------------ Feature set 22 - no. of features: 161 ------------------------ linear score: 350043196622.1782 ridge score: 0.268047533689961 lasso score: 0.2652426682683011 enet score: 0.26894659288519046 knn score: 0.39045354707055274 ------------------------ Feature set 23 - no. of features: 159 ------------------------ linear score: 377870080671.39307 ridge score: 0.2705056123940616 lasso score: 0.26789609307792994 enet score: 0.2707559003852241 knn score: 0.38904010677946105 ------------------------ Feature set 24 - no. of features: 158 ------------------------ linear score: 3104135667.9395385 ridge score: 0.27033190746442637 lasso score: 0.2676624562972701 enet score: 0.2707559011353997 knn score: 0.3886890778932874 ------------------------ Feature set 25 - no. of features: 150 ------------------------ linear score: 1869771438846.8086 ridge score: 0.26699127107367066 lasso score: 0.26434170305054444 enet score: 0.2684597116333315 knn score: 0.38542810130459343 ------------------------ Feature set 26 - no. of features: 148 ------------------------ linear score: 288640956544.1784 ridge score: 0.26930416488244674 lasso score: 0.26670505779874404 enet score: 0.2701864691381317 knn score: 0.3839237542484634 ------------------------ Feature set 27 - no. of features: 147 ------------------------ linear score: 7310857269.594684 ridge score: 0.26937266445174146 lasso score: 0.2665556724208887 enet score: 0.27018647057534395 knn score: 0.3826870407198087 ------------------------ Feature set 28 - no. of features: 149 ------------------------ linear score: 4467819773.616274 ridge score: 0.26695429510153057 lasso score: 0.2642751656029013 enet score: 0.2684597112224402 knn score: 0.3851624052171292 ------------------------ Feature set 29 - no. of features: 147 ------------------------ linear score: 585802859093.3098 ridge score: 0.2692660864398662 lasso score: 0.26664026083749115 enet score: 0.2701864686242822 knn score: 0.38263580369836087 ------------------------ Feature set 30 - no. of features: 146 ------------------------ linear score: 1572680283.4260325 ridge score: 0.26931472764649894 lasso score: 0.26649799747741243 enet score: 0.270186470063836 knn score: 0.3819727312021332 ------------------------ Feature set 31 - no. of features: 104 ------------------------ linear score: 163095382.15619403 ridge score: 0.26602877303788214 lasso score: 0.26542468242976064 enet score: 0.2697087386685089 knn score: 0.37440599311664746 ------------------------ Feature set 32 - no. of features: 102 ------------------------ linear score: 599872023.6810648 ridge score: 0.2687921540079789 lasso score: 0.2682186330209429 enet score: 0.2716584176888437 knn score: 0.37402599612564436 ------------------------ Feature set 33 - no. of features: 101 ------------------------ linear score: 80443586.3290669 ridge score: 0.2685589700993856 lasso score: 0.26794636642673464 enet score: 0.2716584178807362 knn score: 0.3736507312043177 ------------------------ Feature set 34 - no. of features: 103 ------------------------ linear score: 804587389.0954152 ridge score: 0.2659768473020594 lasso score: 0.2653554833518633 enet score: 0.26970873822424213 knn score: 0.3736569523075747 ------------------------ Feature set 35 - no. of features: 101 ------------------------ linear score: 49754818.461513534 ridge score: 0.26874520830885185 lasso score: 0.2681429838028002 enet score: 0.27165841556377757 knn score: 0.374118577456923 ------------------------ Feature set 36 - no. of features: 100 ------------------------ linear score: 35492572.97698884 ridge score: 0.26850996856037923 lasso score: 0.2678760584744464 enet score: 0.27165841575626504 knn score: 0.3733913247895223 ------------------------ Feature set 37 - no. of features: 195 ------------------------ linear score: 429739235561.3691 ridge score: 0.26582108532119153 lasso score: 0.26286739226774747 enet score: 0.2663392378051899 knn score: 0.39450743451754366 ------------------------ Feature set 38 - no. of features: 193 ------------------------ linear score: 795093676.5030818 ridge score: 0.2683052546515699 lasso score: 0.2647967412745206 enet score: 0.26801221295669037 knn score: 0.3921137529850729 ------------------------ Feature set 39 - no. of features: 192 ------------------------ linear score: 5827312953.500171 ridge score: 0.26796617121662186 lasso score: 0.26463438794290245 enet score: 0.2680122129039808 knn score: 0.3928911982041632 ------------------------ Feature set 40 - no. of features: 194 ------------------------ linear score: 4656088304.802507 ridge score: 0.2657817074475646 lasso score: 0.26286738051267705 enet score: 0.2663392376958976 knn score: 0.3942285630484057 ------------------------ Feature set 41 - no. of features: 192 ------------------------ linear score: 12542486241.42647 ridge score: 0.26824297351039644 lasso score: 0.26479673925579855 enet score: 0.26801221230653066 knn score: 0.3925841455813262 ------------------------ Feature set 42 - no. of features: 191 ------------------------ linear score: 2116023323096.6548 ridge score: 0.2678973168670768 lasso score: 0.2646343856271326 enet score: 0.2680122122535625 knn score: 0.39333980285149195 ------------------------ Feature set 43 - no. of features: 179 ------------------------ linear score: 2989526594.9142118 ridge score: 0.26512119542260904 lasso score: 0.26219599831355306 enet score: 0.2658461256500656 knn score: 0.3917015435460742 ------------------------ Feature set 44 - no. of features: 177 ------------------------ linear score: 11881414007.267033 ridge score: 0.267428801601722 lasso score: 0.2639433445843537 enet score: 0.2674833631573199 knn score: 0.3904548846979334 ------------------------ Feature set 45 - no. of features: 176 ------------------------ linear score: 12011118682.786938 ridge score: 0.26727273996098644 lasso score: 0.26376727036884084 enet score: 0.2674833630491733 knn score: 0.3906785632752654 ------------------------ Feature set 46 - no. of features: 178 ------------------------ linear score: 10282738694.902302 ridge score: 0.2650880526042144 lasso score: 0.26219602635497596 enet score: 0.26584612549351416 knn score: 0.39112315736816855 ------------------------ Feature set 47 - no. of features: 176 ------------------------ linear score: 726978931700.4084 ridge score: 0.267388159035902 lasso score: 0.26394337473616475 enet score: 0.2674833631518689 knn score: 0.38985558705045065 ------------------------ Feature set 48 - no. of features: 175 ------------------------ linear score: 6564802654.082502 ridge score: 0.2672222400548591 lasso score: 0.26376730557454386 enet score: 0.26748336304372233 knn score: 0.3902715112676825 ------------------------ Feature set 49 - no. of features: 133 ------------------------ linear score: 5349033906.8525 ridge score: 0.2647390151112522 lasso score: 0.2626580190126668 enet score: 0.26690677621359366 knn score: 0.3865890346398679 ------------------------ Feature set 50 - no. of features: 131 ------------------------ linear score: 1594627968.0256827 ridge score: 0.26733232523063544 lasso score: 0.2646159932892151 enet score: 0.2686418036917737 knn score: 0.3861950913364338 ------------------------ Feature set 51 - no. of features: 130 ------------------------ linear score: 626584724.5272152 ridge score: 0.26698714276068547 lasso score: 0.26444718978620735 enet score: 0.2686418036461224 knn score: 0.38672962863230265 ------------------------ Feature set 52 - no. of features: 132 ------------------------ linear score: 1669488588.5715802 ridge score: 0.2646678683891023 lasso score: 0.26265805242959356 enet score: 0.26690677634611637 knn score: 0.3871048595336449 ------------------------ Feature set 53 - no. of features: 130 ------------------------ linear score: 190427441.73221257 ridge score: 0.26726071547735963 lasso score: 0.2646160453888785 enet score: 0.26864180388322934 knn score: 0.3871591867222596 ------------------------ Feature set 54 - no. of features: 129 ------------------------ linear score: 1756395290.7639327 ridge score: 0.2669132092463511 lasso score: 0.26444716648301675 enet score: 0.2686418038376138 knn score: 0.3867801958512519
# print out the unsorted df with the result from the gridcv
gridsearch_result_df = pd.DataFrame(grid_search_result)
gridsearch_result_df.head(3)
| feature_set | num of features | linear: grid search object | linear: best estimator | linear: best score | linear: best params | ridge: grid search object | ridge: best estimator | ridge: best score | ridge: best params | lasso: grid search object | lasso: best estimator | lasso: best score | lasso: best params | enet: grid search object | enet: best estimator | enet: best score | enet: best params | knn: grid search object | knn: best estimator | knn: best score | knn: best params | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | feature_set_0 | 271 | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), LinearRegression()) | 4.984887e+11 | {} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Ridge(alpha=162.97508346206... | 0.267259 | {'ridge__alpha': 162.9750834620645} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Lasso(alpha=0.0046415888336... | 0.263098 | {'lasso__alpha': 0.004641588833612782} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), ElasticNet(alpha=0.1, l1_ra... | 0.262586 | {'enet__alpha': 0.1, 'enet__l1_ratio': 0.1} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), KNeighborsRegressor(n_neigh... | 0.398708 | {'knn__n_neighbors': 9, 'knn__weights': 'dista... |
| 1 | feature_set_1 | 240 | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), LinearRegression()) | 6.926562e+11 | {} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Ridge(alpha=123.28467394420... | 0.267226 | {'ridge__alpha': 123.28467394420659} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Lasso(alpha=0.0046415888336... | 0.263089 | {'lasso__alpha': 0.004641588833612782} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), ElasticNet(alpha=0.1, l1_ra... | 0.263639 | {'enet__alpha': 0.1, 'enet__l1_ratio': 0.1} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), KNeighborsRegressor(n_neigh... | 0.397234 | {'knn__n_neighbors': 11, 'knn__weights': 'dist... |
| 2 | feature_set_2 | 238 | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), LinearRegression()) | 2.093743e+12 | {} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Ridge(alpha=123.28467394420... | 0.267509 | {'ridge__alpha': 123.28467394420659} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Lasso(alpha=0.0046415888336... | 0.263540 | {'lasso__alpha': 0.004641588833612782} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), ElasticNet(alpha=0.1, l1_ra... | 0.263815 | {'enet__alpha': 0.1, 'enet__l1_ratio': 0.1} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), KNeighborsRegressor(n_neigh... | 0.397691 | {'knn__n_neighbors': 11, 'knn__weights': 'dist... |
# normalize gridsearch scores using min-max normalization
gridsearch_result_df_normalized = normalize_values_columns(gridsearch_result_df, ["linear: best score", "ridge: best score", "lasso: best score", "enet: best score", "knn: best score"])
# Top feature sets by mean normalized score
gridsearch_result_df_normalized["mean normalized score"] = (gridsearch_result_df_normalized["knn: best score (normalized)"] +
gridsearch_result_df_normalized["ridge: best score (normalized)"] +
gridsearch_result_df_normalized["lasso: best score (normalized)"] +
gridsearch_result_df_normalized["enet: best score (normalized)"] +
gridsearch_result_df_normalized["linear: best score (normalized)"]) / 5
gridsearch_result_df_normalized.sort_values(by = "mean normalized score").head()
| feature_set | num of features | linear: grid search object | linear: best estimator | linear: best score | linear: best params | ridge: grid search object | ridge: best estimator | ridge: best score | ridge: best params | lasso: grid search object | lasso: best estimator | lasso: best score | lasso: best params | enet: grid search object | enet: best estimator | enet: best score | enet: best params | knn: grid search object | knn: best estimator | knn: best score | knn: best params | linear: best score (normalized) | ridge: best score (normalized) | lasso: best score (normalized) | enet: best score (normalized) | knn: best score (normalized) | mean normalized score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16 | feature_set_16 | 172 | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), LinearRegression()) | 2.308888e+11 | {} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Ridge(alpha=93.260334688321... | 0.262360 | {'ridge__alpha': 93.26033468832199} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Lasso(alpha=0.0046415888336... | 0.259526 | {'lasso__alpha': 0.004641588833612782} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), ElasticNet(alpha=0.1, l1_ra... | 0.263156 | {'enet__alpha': 0.1, 'enet__l1_ratio': 0.1} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), KNeighborsRegressor(n_neigh... | 0.392998 | {'knn__n_neighbors': 11, 'knn__weights': 'dist... | 0.038386 | 0.000000 | 0.000000 | 0.062763 | 0.774432 | 0.175116 |
| 13 | feature_set_13 | 176 | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), LinearRegression()) | 1.212925e+12 | {} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Ridge(alpha=81.113083078968... | 0.264417 | {'ridge__alpha': 81.11308307896873} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Lasso(alpha=0.0046415888336... | 0.260680 | {'lasso__alpha': 0.004641588833612782} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), ElasticNet(alpha=0.1, l1_ra... | 0.263576 | {'enet__alpha': 0.1, 'enet__l1_ratio': 0.1} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), KNeighborsRegressor(n_neigh... | 0.391411 | {'knn__n_neighbors': 11, 'knn__weights': 'dist... | 0.201678 | 0.250525 | 0.132754 | 0.109079 | 0.711755 | 0.281158 |
| 15 | feature_set_15 | 173 | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), LinearRegression()) | 1.322120e+11 | {} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Ridge(alpha=81.113083078968... | 0.265071 | {'ridge__alpha': 81.11308307896873} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Lasso(alpha=0.0046415888336... | 0.261253 | {'lasso__alpha': 0.004641588833612782} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), ElasticNet(alpha=0.1, l1_ra... | 0.264046 | {'enet__alpha': 0.1, 'enet__l1_ratio': 0.1} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), KNeighborsRegressor(n_neigh... | 0.394262 | {'knn__n_neighbors': 11, 'knn__weights': 'dist... | 0.021978 | 0.330266 | 0.198676 | 0.160882 | 0.824366 | 0.307234 |
| 14 | feature_set_14 | 174 | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), LinearRegression()) | 4.177051e+11 | {} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Ridge(alpha=81.113083078968... | 0.265339 | {'ridge__alpha': 81.11308307896873} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Lasso(alpha=0.0046415888336... | 0.261523 | {'lasso__alpha': 0.004641588833612782} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), ElasticNet(alpha=0.1, l1_ra... | 0.264068 | {'enet__alpha': 0.1, 'enet__l1_ratio': 0.1} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), KNeighborsRegressor(n_neigh... | 0.393900 | {'knn__n_neighbors': 11, 'knn__weights': 'dist... | 0.069450 | 0.362931 | 0.229740 | 0.163339 | 0.810059 | 0.327104 |
| 49 | feature_set_49 | 133 | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), LinearRegression()) | 5.349034e+09 | {} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Ridge(alpha=17.475284000076... | 0.264739 | {'ridge__alpha': 17.475284000076847} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Lasso(alpha=0.0046415888336... | 0.262658 | {'lasso__alpha': 0.004641588833612782} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), ElasticNet(alpha=0.1, l1_ra... | 0.266907 | {'enet__alpha': 0.1, 'enet__l1_ratio': 0.1} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), KNeighborsRegressor(n_neigh... | 0.386589 | {'knn__n_neighbors': 9, 'knn__weights': 'dista... | 0.000884 | 0.289818 | 0.360336 | 0.476234 | 0.521296 | 0.329714 |
# Top feature sets by lasso score - we use lasso as it seems to be the best performing across different estimators
gridsearch_result_df_normalized.sort_values(by = "lasso: best score").head()
| feature_set | num of features | linear: grid search object | linear: best estimator | linear: best score | linear: best params | ridge: grid search object | ridge: best estimator | ridge: best score | ridge: best params | lasso: grid search object | lasso: best estimator | lasso: best score | lasso: best params | enet: grid search object | enet: best estimator | enet: best score | enet: best params | knn: grid search object | knn: best estimator | knn: best score | knn: best params | linear: best score (normalized) | ridge: best score (normalized) | lasso: best score (normalized) | enet: best score (normalized) | knn: best score (normalized) | mean normalized score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16 | feature_set_16 | 172 | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), LinearRegression()) | 2.308888e+11 | {} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Ridge(alpha=93.260334688321... | 0.262360 | {'ridge__alpha': 93.26033468832199} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Lasso(alpha=0.0046415888336... | 0.259526 | {'lasso__alpha': 0.004641588833612782} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), ElasticNet(alpha=0.1, l1_ra... | 0.263156 | {'enet__alpha': 0.1, 'enet__l1_ratio': 0.1} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), KNeighborsRegressor(n_neigh... | 0.392998 | {'knn__n_neighbors': 11, 'knn__weights': 'dist... | 0.038386 | 0.000000 | 0.000000 | 0.062763 | 0.774432 | 0.175116 |
| 13 | feature_set_13 | 176 | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), LinearRegression()) | 1.212925e+12 | {} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Ridge(alpha=81.113083078968... | 0.264417 | {'ridge__alpha': 81.11308307896873} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Lasso(alpha=0.0046415888336... | 0.260680 | {'lasso__alpha': 0.004641588833612782} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), ElasticNet(alpha=0.1, l1_ra... | 0.263576 | {'enet__alpha': 0.1, 'enet__l1_ratio': 0.1} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), KNeighborsRegressor(n_neigh... | 0.391411 | {'knn__n_neighbors': 11, 'knn__weights': 'dist... | 0.201678 | 0.250525 | 0.132754 | 0.109079 | 0.711755 | 0.281158 |
| 10 | feature_set_10 | 218 | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), LinearRegression()) | 6.014031e+12 | {} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Ridge(alpha=123.28467394420... | 0.264453 | {'ridge__alpha': 123.28467394420659} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Lasso(alpha=0.0046415888336... | 0.260845 | {'lasso__alpha': 0.004641588833612782} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), ElasticNet(alpha=0.1, l1_ra... | 0.262793 | {'enet__alpha': 0.1, 'enet__l1_ratio': 0.1} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), KNeighborsRegressor(n_neigh... | 0.395777 | {'knn__n_neighbors': 11, 'knn__weights': 'dist... | 1.000000 | 0.255019 | 0.151779 | 0.022777 | 0.884204 | 0.462756 |
| 15 | feature_set_15 | 173 | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), LinearRegression()) | 1.322120e+11 | {} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Ridge(alpha=81.113083078968... | 0.265071 | {'ridge__alpha': 81.11308307896873} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Lasso(alpha=0.0046415888336... | 0.261253 | {'lasso__alpha': 0.004641588833612782} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), ElasticNet(alpha=0.1, l1_ra... | 0.264046 | {'enet__alpha': 0.1, 'enet__l1_ratio': 0.1} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), KNeighborsRegressor(n_neigh... | 0.394262 | {'knn__n_neighbors': 11, 'knn__weights': 'dist... | 0.021978 | 0.330266 | 0.198676 | 0.160882 | 0.824366 | 0.307234 |
| 14 | feature_set_14 | 174 | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), LinearRegression()) | 4.177051e+11 | {} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Ridge(alpha=81.113083078968... | 0.265339 | {'ridge__alpha': 81.11308307896873} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), Lasso(alpha=0.0046415888336... | 0.261523 | {'lasso__alpha': 0.004641588833612782} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), ElasticNet(alpha=0.1, l1_ra... | 0.264068 | {'enet__alpha': 0.1, 'enet__l1_ratio': 0.1} | GridSearchCV(cv=5,\n estimator=Pip... | (StandardScaler(), KNeighborsRegressor(n_neigh... | 0.393900 | {'knn__n_neighbors': 11, 'knn__weights': 'dist... | 0.069450 | 0.362931 | 0.229740 | 0.163339 | 0.810059 | 0.327104 |
# DataFrame with just scores and plot a line plot of different scores and
gridsearch_result_just_scores = gridsearch_result_df_normalized[["feature_set", "num of features",
"linear: best score",
"ridge: best score",
"lasso: best score",
"enet: best score",
"knn: best score",
"linear: best score (normalized)",
"ridge: best score (normalized)",
"lasso: best score (normalized)",
"enet: best score (normalized)",
"knn: best score (normalized)",
"mean normalized score"]]
gridsearch_result_just_scores.head()
| feature_set | num of features | linear: best score | ridge: best score | lasso: best score | enet: best score | knn: best score | linear: best score (normalized) | ridge: best score (normalized) | lasso: best score (normalized) | enet: best score (normalized) | knn: best score (normalized) | mean normalized score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | feature_set_0 | 271 | 4.984887e+11 | 0.267259 | 0.263098 | 0.262586 | 0.398708 | 0.082882 | 0.596846 | 0.410927 | 0.000000 | 1.000000 | 0.418131 |
| 1 | feature_set_1 | 240 | 6.926562e+11 | 0.267226 | 0.263089 | 0.263639 | 0.397234 | 0.115168 | 0.592910 | 0.409883 | 0.116068 | 0.941778 | 0.435161 |
| 2 | feature_set_2 | 238 | 2.093743e+12 | 0.267509 | 0.263540 | 0.263815 | 0.397691 | 0.348139 | 0.627325 | 0.461780 | 0.135455 | 0.959818 | 0.506503 |
| 3 | feature_set_3 | 237 | 2.220203e+12 | 0.267286 | 0.263287 | 0.263793 | 0.398032 | 0.369167 | 0.600208 | 0.432635 | 0.133051 | 0.973291 | 0.501670 |
| 4 | feature_set_4 | 236 | 7.226254e+11 | 0.265447 | 0.261957 | 0.263356 | 0.397880 | 0.120151 | 0.376029 | 0.279684 | 0.084883 | 0.967279 | 0.365605 |
gridsearch_result_just_scores_long = gridsearch_result_just_scores.drop(["linear: best score",
"linear: best score (normalized)",
"ridge: best score (normalized)",
"lasso: best score (normalized)",
"enet: best score (normalized)",
"knn: best score (normalized)",
"mean normalized score"], axis = 1).melt(id_vars = ["feature_set", "num of features"])
gridsearch_result_just_scores_long.head()
| feature_set | num of features | variable | value | |
|---|---|---|---|---|
| 0 | feature_set_0 | 271 | ridge: best score | 0.267259 |
| 1 | feature_set_1 | 240 | ridge: best score | 0.267226 |
| 2 | feature_set_2 | 238 | ridge: best score | 0.267509 |
| 3 | feature_set_3 | 237 | ridge: best score | 0.267286 |
| 4 | feature_set_4 | 236 | ridge: best score | 0.265447 |
# By feature set index
f, ax = plt.subplots(2, sharex=True, figsize = (24, 14))
sns.lineplot(x="feature_set", y="value", hue = "variable",
data=gridsearch_result_just_scores_long,
palette='colorblind', marker = 'o', markersize = 8, linewidth = 2, ax = ax[0]);
plt.xticks(rotation = 90);
ax[0].set_title("Performance comparison by estimator")
ax[0].set_ylabel("RMSE (Lower the better)")
ax[0].set_xlabel("Feature set")
ax[0].legend(loc = 'best');
sns.barplot(x="feature_set", y="num of features",
data=gridsearch_result_just_scores_long,
color = 'teal', alpha = 0.5, ax = ax[1]);
plt.xticks(rotation = 90);
ax[1].set_ylabel("Number of features")
ax[1].set_xlabel("Feature set")
#ax[1].xaxis.set_label_position('top')
ax[1].invert_yaxis()
plt.tight_layout()
# zoom in to the 3 regularized models
gridsearch_result_just_scores_long = gridsearch_result_just_scores.drop(["linear: best score",
"linear: best score (normalized)",
"ridge: best score (normalized)",
"lasso: best score (normalized)",
"enet: best score (normalized)",
"knn: best score (normalized)",
"knn: best score",
"mean normalized score"], axis = 1).melt(id_vars = ["feature_set", "num of features"])
# By feature set index
f, ax = plt.subplots(2, sharex=True, figsize = (24, 14))
sns.lineplot(x="feature_set", y="value", hue = "variable",
data=gridsearch_result_just_scores_long,
palette='colorblind', marker = 'o', markersize = 8, linewidth = 2, ax = ax[0]);
plt.xticks(rotation = 90);
ax[0].set_title("Performance comparison by estimator")
ax[0].set_ylabel("RMSE (Lower the better)")
ax[0].set_xlabel("Feature set")
ax[0].legend(loc = 'best');
sns.barplot(x="feature_set", y="num of features",
data=gridsearch_result_just_scores_long,
color = 'teal', alpha = 0.5, ax = ax[1]);
plt.xticks(rotation = 90);
ax[1].set_ylabel("Number of features")
ax[1].set_xlabel("Feature set")
#ax[1].xaxis.set_label_position('top')
ax[1].invert_yaxis()
plt.tight_layout()
We see that out of all the estimators, the three estimators (ridge, lasso, elastic net) with regularization performed comparatively, with Lasso regression having a slight edge across all feature sets. Linear estimators were clearly not able to handle the number of features, while KNN regressor resulted in signicantly larger error as well.
The best feature set turn out to be feature_set_16.
We have identified feature_set_16 as the feature set that produce the lowest mean RMSE score. Now, we will do further hyperparameter tuning by running the feature set through lasso regressor which seems to produce the best result across estimators, we will use a more detailed search space to pick out the better hyperparameters.
# We pick feature_set_16 for further tuning
best_feature_set = "feature_set_16" #<------------insert best feature set name
# Different estimators and their parameters
estimators = {
"lasso": {'estimator': Lasso(), 'params': {"lasso__alpha": np.logspace(-4,1,1000)}},
}
"""estimators = {
"enet": {'estimator': ElasticNet(),
"params": {"enet__alpha": np.linspace(0.1, 1, 100), "enet__l1_ratio": np.linspace(0, 1, 20)}},
}"""
# conduct grid search and save results in a list
# loop through all feature sets, then loop through selected estimators and run pipeline
round2_gridsearch_result = {}
for estimator_name, estimator_info in estimators.items():
#print(f"{estimator_name}------------------------------------------------------------------")
#print()
pipe = Pipeline([
("ss", StandardScaler()),
(estimator_name, estimator_info['estimator'])
])
pipe_params = estimator_info['params']
pipe_gridsearch = GridSearchCV(pipe,
pipe_params,
cv = 5,
verbose = 1,
scoring = "neg_root_mean_squared_error",
n_jobs = 7)
pipe_gridsearch.fit(X_trains[best_feature_set], y_trains[best_feature_set])
print(f"{estimator_name} score:", -pipe_gridsearch.best_score_)
# add estimator and score into a dictionary
round2_gridsearch_result["feature_set"] = best_feature_set
round2_gridsearch_result["num of features"] = X_trains[best_feature_set].shape[1]
round2_gridsearch_result[f"{estimator_name}: grid search object"] = pipe_gridsearch
round2_gridsearch_result[f"{estimator_name}: best estimator"] = pipe_gridsearch.best_estimator_
round2_gridsearch_result[f"{estimator_name}: best score"] = -(pipe_gridsearch.best_score_)
round2_gridsearch_result[f"{estimator_name}: best params"] = pipe_gridsearch.best_params_
Fitting 5 folds for each of 1000 candidates, totalling 5000 fits
[Parallel(n_jobs=7)]: Using backend LokyBackend with 7 concurrent workers. [Parallel(n_jobs=7)]: Done 36 tasks | elapsed: 10.0s [Parallel(n_jobs=7)]: Done 186 tasks | elapsed: 21.7s [Parallel(n_jobs=7)]: Done 436 tasks | elapsed: 42.6s [Parallel(n_jobs=7)]: Done 786 tasks | elapsed: 1.1min [Parallel(n_jobs=7)]: Done 1236 tasks | elapsed: 1.5min [Parallel(n_jobs=7)]: Done 1815 tasks | elapsed: 1.7min [Parallel(n_jobs=7)]: Done 3115 tasks | elapsed: 1.9min [Parallel(n_jobs=7)]: Done 4615 tasks | elapsed: 2.2min
lasso score: 0.25951780177444805
[Parallel(n_jobs=7)]: Done 4987 out of 5000 | elapsed: 2.2min remaining: 0.2s [Parallel(n_jobs=7)]: Done 5000 out of 5000 | elapsed: 2.2min finished
best_alpha = round2_gridsearch_result['lasso: best params']['lasso__alpha']
ss = StandardScaler()
X_train_best_features_scaled = ss.fit_transform(X_trains[best_feature_set])
X_train_best_features_scaled_sm = sm.add_constant(X_train_best_features_scaled)
y_best_features = y_trains[best_feature_set]
model = sm.OLS(y_best_features, X_train_best_features_scaled_sm)
results_fu = model.fit()
results_fr = model.fit_regularized(L1_wt=1.0, alpha=best_alpha, start_params=results_fu.params, profile_scale = False)
final = sm.regression.linear_model.OLSResults(model,
results_fr.params,
model.normalized_cov_params)
best_feature_lasso_summary_df = pd.DataFrame(list(X_trains[best_feature_set].columns), columns = ["Feature"])
best_feature_lasso_summary_df["P-values"] = final.pvalues[1:]
best_feature_lasso_summary_df["Coefficients"] = final.params.iloc[1:].values
best_feature_lasso_summary_df[best_feature_lasso_summary_df["P-values"] <= 0.05].sort_values(by = "Coefficients")
| Feature | P-values | Coefficients | |
|---|---|---|---|
| 40 | Age Since Built | 7.001457e-08 | -0.121110 |
| 140 | Heating_Grav | 3.688317e-04 | -0.019417 |
| 32 | Kitchen AbvGr | 1.359346e-02 | -0.017650 |
| 87 | Condition 1_PosN | 1.617536e-02 | 0.016892 |
| 3 | Exter Qual | 1.918609e-02 | 0.024849 |
| 107 | Roof Matl_Membran | 4.229414e-03 | 0.025091 |
| 74 | Neighborhood_NoRidge | 2.771515e-02 | 0.025193 |
| 81 | Neighborhood_StoneBr | 1.379371e-02 | 0.025397 |
| 7 | Heating QC | 7.530153e-04 | 0.026284 |
| 66 | Neighborhood_GrnHill | 6.684723e-06 | 0.027299 |
| 157 | Paved Drive_Y | 9.179885e-04 | 0.027338 |
| 39 | Screen Porch | 1.446207e-06 | 0.028663 |
| 85 | Condition 1_Norm | 1.142005e-02 | 0.029105 |
| 122 | Bsmt Exposure_Gd | 1.077370e-04 | 0.029867 |
| 35 | Garage Area | 5.626792e-03 | 0.030775 |
| 9 | Kitchen Qual | 2.620128e-04 | 0.034278 |
| 33 | Fireplaces | 2.306414e-02 | 0.035722 |
| 160 | Misc Feature_Othr | 1.472714e-03 | 0.037227 |
| 158 | Misc Feature_Gar2 | 2.544097e-03 | 0.040548 |
| 111 | Roof Matl_WdShake | 3.500266e-03 | 0.040845 |
| 75 | Neighborhood_NridgHt | 8.701349e-03 | 0.042713 |
| 10 | Functional | 6.463639e-15 | 0.050361 |
| 112 | Roof Matl_WdShngl | 7.291189e-05 | 0.061059 |
| 23 | BsmtFin SF 1 | 9.359692e-06 | 0.063310 |
| 110 | Roof Matl_Tar&Grv | 1.084747e-03 | 0.083116 |
| 21 | Lot Area | 7.218486e-16 | 0.084233 |
| 2 | Overall Cond | 9.364717e-30 | 0.095851 |
| 106 | Roof Matl_CompShg | 1.551090e-04 | 0.122275 |
| 161 | Misc Feature_Shed | 1.211425e-03 | 0.143793 |
| 1 | Overall Qual | 2.739657e-37 | 0.162288 |
| 159 | Misc Feature_NA | 5.201462e-04 | 0.164299 |
| 27 | Gr Liv Area | 1.453096e-03 | 0.181724 |
best_feature_lasso_summary_df.sort_values(by = "Coefficients")
| Feature | P-values | Coefficients | |
|---|---|---|---|
| 40 | Age Since Built | 7.001457e-08 | -0.121110 |
| 63 | Neighborhood_Edwards | 1.524208e-01 | -0.028253 |
| 140 | Heating_Grav | 3.688317e-04 | -0.019417 |
| 32 | Kitchen AbvGr | 1.359346e-02 | -0.017650 |
| 14 | Age Since Remod/Add | 8.679654e-02 | -0.017512 |
| ... | ... | ... | ... |
| 106 | Roof Matl_CompShg | 1.551090e-04 | 0.122275 |
| 161 | Misc Feature_Shed | 1.211425e-03 | 0.143793 |
| 1 | Overall Qual | 2.739657e-37 | 0.162288 |
| 159 | Misc Feature_NA | 5.201462e-04 | 0.164299 |
| 27 | Gr Liv Area | 1.453096e-03 | 0.181724 |
172 rows × 3 columns
# Using our best feature set and best estimator, we will predict the SalePrice of our training set
ss = StandardScaler()
X_train_best_features_scaled = ss.fit_transform(X_trains[best_feature_set])
prediction_train = round2_gridsearch_result["lasso: best estimator"][1].predict(X_train_best_features_scaled)
# Scale back using inverse boxcox
prediction_train = special.inv_boxcox1p(prediction_train, sale_price_lambda[0]) # prediction
actual_train = special.inv_boxcox1p(y_trains[best_feature_set], sale_price_lambda[0]) # actual
residual = residual = actual_train - prediction_train
# plot a joint plot
j_plot = sns.jointplot(x = prediction_train, y = actual_train, height=10, ratio=7, marginal_ticks=True,
xlim=(0, 650000), ylim=(0, 650000),)
j_plot.fig.suptitle("Actual SalePrice vs Predicted SalePrice (Training Set)", fontsize=20)
j_plot.set_axis_labels("Predicted Sale Price", "Actual Sale Price", fontsize=14)
x0, x1 = j_plot.ax_joint.get_xlim()
y0, y1 = j_plot.ax_joint.get_ylim()
lims = [max(x0, y0), min(x1, y1)]
j_plot.ax_joint.plot(lims, lims, 'teal')
j_plot.fig.tight_layout();
fig, ax = plt.subplots(figsize = (10, 10))
sns.scatterplot(x = prediction_train, y = residual, ax = ax)
plt.axhline(0, color = 'teal')
plt.xlabel("Predicted Sale Price")
plt.ylabel("Residual Value (Actual Price - Predicted Price)")
plt.title("Residual vs Predicted SalePrice (Training Set)");
fig, ax = plt.subplots(figsize = (10, 10))
sns.scatterplot(x = prediction_train, y = actual_train, ax = ax)
#ax.plot([600000, 600000], [0, 0], 'red', linewidth=10)
sns.lineplot(y = [650000, 0], x = [650000, 0], ax = ax, color = 'teal')
plt.xlim(0, 650000)
plt.ylim(0, 650000)
plt.xlabel("Predicted Sale Price")
plt.ylabel("Actual Sale Price")
plt.title("Actual SalePrice vs Predicted SalePrice (Training Set)");
We see the our points are rather evenly spread about around the line of perfect prediction. On the higher end of the SalePrice, our linear model does tend to 'underpredict' for some points, while for two points in particular, the model "overpredicted" by quite a large margin. This indicates that our model still has room to improve in terms of predicting properties with higher prices. That being said, most of our sale prices do not fall on the higher ranges, so our model does a pretty good job.
Now let us review our best feature set. It consists of 174(after dummify) features and was a result of dropping the following features.
feature_sets_to_drop_dict[best_feature_set]
['Has Pool', 'Pool Area', 'Low Qual Fin SF', 'Garage Cars', 'Bsmt Unf SF', 'Has Screen Porch', 'Bsmt Half Bath', 'Yr Sold', 'Year Built', 'Has 3Ssn Porch', 'Misc Val', 'Alley', 'Exterior 2nd', 'Exterior 1st', 'BsmtFin SF 2', '3Ssn Porch', 'MS SubClass', 'Bldg Type', 'Lot Config', 'Has 2nd Flr', 'Mo Sold', 'Pool QC', 'Lot Shape', 'TotRms AbvGrd', 'Roof Style', 'Fence', 'Year Remod/Add']
We will now interprete the coefficients of our result from 2 rounds of modelling and identify the most important features influencing Sale Price.
# Get coefficients
best_estimator_coef = pd.Series(round2_gridsearch_result["lasso: best estimator"][1].coef_, name = "Coefficients")
best_estimator_features = pd.Series(X_trains[best_feature_set].columns, name = "Features")
results_df = pd.DataFrame([best_estimator_coef, best_estimator_features]).T.sort_values(by = "Coefficients", ascending=False)
results_df.head()
| Coefficients | Features | |
|---|---|---|
| 27 | 0.194671 | Gr Liv Area |
| 1 | 0.155336 | Overall Qual |
| 2 | 0.0940919 | Overall Cond |
| 21 | 0.0892683 | Lot Area |
| 42 | 0.0722996 | Gross Flr Area |
results_df.tail(15)
| Coefficients | Features | |
|---|---|---|
| 88 | -0.00462558 | Condition 1_RRAe |
| 53 | -0.00487087 | Utilities_NoSeWa |
| 150 | -0.00643505 | Garage Type_CarPort |
| 142 | -0.00953385 | Heating_Wall |
| 125 | -0.012751 | Bsmt Exposure_No |
| 69 | -0.0128542 | Neighborhood_MeadowV |
| 14 | -0.0130637 | Age Since Remod/Add |
| 67 | -0.0146313 | Neighborhood_IDOTRR |
| 131 | -0.0165561 | BsmtFin Type 1_Unf |
| 32 | -0.0181457 | Kitchen AbvGr |
| 76 | -0.0196031 | Neighborhood_OldTown |
| 140 | -0.0204031 | Heating_Grav |
| 43 | -0.0253496 | MS Zoning_C (all) |
| 63 | -0.0334698 | Neighborhood_Edwards |
| 40 | -0.107788 | Age Since Built |
# all features
def positive_or_negative(row):
if row["Coefficients"] >0:
row["Impact"] = "Positive Impact"
elif row["Coefficients"] == 0:
row["Impact"] = "Little Impact"
else:
row["Impact"] = "Negative Impact"
return row
results_df_impact = results_df.apply(positive_or_negative, axis = 1)
plt.figure(figsize = (24, 8))
sns.barplot(data = results_df_most_impt, x = "Features", y = "Coefficients",
hue = "Impact", palette='colorblind')
plt.xticks(rotation = 90)
plt.title("Top most important factors influencing housing price in Ames");
Here, we can easily visualize the most important factors having an impact on sale price, grouped by positive and negative impact.
Perhaps a good guide on location is by coming up with a ranking of neighborhoods, by the potential impact on sale price.
results_df_nbh = results_df[results_df['Features'].str.contains("Neighborhood")].sort_values(by = "Coefficients", ascending = False)
results_df_impact_nbh = results_df_nbh.apply(positive_or_negative, axis = 1)
results_df_impact_nbh.head()
| Coefficients | Features | Impact | |
|---|---|---|---|
| 75 | 0.050353 | Neighborhood_NridgHt | Positive Impact |
| 80 | 0.029731 | Neighborhood_Somerst | Positive Impact |
| 74 | 0.029716 | Neighborhood_NoRidge | Positive Impact |
| 81 | 0.028627 | Neighborhood_StoneBr | Positive Impact |
| 66 | 0.026088 | Neighborhood_GrnHill | Positive Impact |
plt.figure(figsize = (24, 8))
sns.barplot(data = results_df_impact_nbh, x = "Features", y = "Coefficients",
hue = "Impact", palette='colorblind')
plt.axhline(0, color = 'grey')
plt.xticks(rotation = 90)
plt.title("Neighborhood Desirability Ranking");
We see that the most important features influencing sale price can be grouped in a few categories:
Features with positive influence on price.
Area
Gr Liv Area>Lot Area>Gross Flr Area>BsmtFin SF1>1st Flr SF. Quality
Overall Qual, Overall Cond, Functional, Kitchen Qual and Basement Qual may also provide valuable info in helping someone to judge sale price. This points to the importance of perhaps hiring a professional property valuer before making any investment decisions.Neighborhoods
Northridge Heights, Somerset and Northridge, Stone Brook, Green Hills and ``.Proximity
Roof material
Features with negative influence on price.
House Age
Age Since Built feature. Age Since Remod/Add also have some negative influence on price.Neighborhoods
Edwards, Old Town, Iowa DOT and Rail Road and Meadow VillageZoning
Heating Type
Gravity Furnace and Wall Furnace generally do not get you a higher price, this is perhaps due to the fact that these are more traditional heating technologiesNumber of Kitchens
Basement
unfinished basement generally fetches you a lower price, same goes for basements without exposureGarage type
carport garage, these are sheltered garages attached to a propertyUtilities
No sewageProxmity
# scaling of test set
X_test_scaled = round2_gridsearch_result["lasso: best estimator"][0].transform(X_tests[best_feature_set])
# predict
prediction = round2_gridsearch_result["lasso: best estimator"][1].predict(X_test_scaled)
# reverse box cox
prediction = special.inv_boxcox1p(prediction, sale_price_lambda[0])
#prediction
prediction_proper = pd.concat([df_original_test["Id"].reset_index(), pd.Series(prediction, name = "SalePrice")], axis = 1)
prediction_proper = prediction_proper.drop("index", axis = 1)
prediction_proper
| Id | SalePrice | |
|---|---|---|
| 0 | 2 | 120264.386188 |
| 1 | 4 | 241315.048322 |
| 2 | 6 | 195688.564390 |
| 3 | 7 | 221104.334691 |
| 4 | 17 | 211648.787119 |
| ... | ... | ... |
| 873 | 2919 | 79857.676810 |
| 874 | 2921 | 84540.509176 |
| 875 | 2922 | 185818.977736 |
| 876 | 2925 | 175967.579891 |
| 877 | 2928 | 127702.805688 |
878 rows × 2 columns
# export prediction
#prediction_proper.to_csv("../datasets/submission.csv", index = None)
Kaggle Score: